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Summary 


This  research  project  involved  the  development  of  mathematical  models  for 
analysis,  synthesis,  and  simulation  of  large  systems  of  interacting  devices.  The  work 
was  motivated  by  problems  that  may  become  important  in  high  density  VLSI  chips 
with  characteristic  feature  sizes  less  than  1  micron;  it  is  anticipated  that  interactions 
of  neighboring  devices  will  play  an  important  role  in  the  determination  of  circuit  pro¬ 
perties.  It  is  hoped  that  the  combination  of  high  device  densities  and  such  local 
interactions  can  somehow  be  exploited  to  increase  circuit  speed  and  to  reduce  power 
consumption.  To  address  these  issues  from  the  point  of  view  of  system  theory, 
research  was  pursued  in  the  areas  of  nonlinear  and  stochastic  systems  and  into  neural 
network  models. 


Statistical  models  were  developed  to  characterize  various  features  of  the 
dynamic  behavior  of  interacting  systems.  Random  process  models  for  studying  the 
resulting  asynchronous  modes  of  operation  were  investigated.  The  local  Interactions 
themselves  may  be  modeled  as  stochastic  effects.  The  resulting  behavior  has  been 
investigated  through  the  use  of  various  scaling  limits,  and  by  a  combination  of  other 
analytical  and  simulation  techniques.  Techniques  arising  in  a  variety  of  disciplines 
where  models  of  interaction  have  been  formulated  and  explored  were  considered  and 
adapted  for  use.  Of  particular  relevance  are  random  field  models  of  spatial  interac¬ 
tion,  various  results  concerning  stochastic  convergence  related  to  the  Central  Limit 
Theorem,  and  some  basic  ideas  about  computational  complexity  related  to  analog 
systems.  Research  into  the  relations  between  state  space  structure  and  the  input- 
output  function  of  large  systems,  using  geometric  and  algebraic  methods  from  non¬ 
linear  system  theory,  was  performed.  Distributed  computation  models,  in  the  form 
of  artificial  neural  networks,  have  been  studied  because  of  the  great  Interest  in  appli¬ 
cations  of  such  systems  to  a  variety  of  problems  in  pattern  classification  and  signal 
processing. 
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Technical  Results 

A  first  area  of  significant  progress  developed  from  an  investigation  of  applica¬ 
tions  of  nonlinear  system  theory  to  realizability  questions  for  linear  filtering.  It  has 
been  shown  [l]  that  internally  nonlinear  systems  do  not  produce  a  broader  class  of 
linear  input-output  behaviors  than  internally  linear  systems.  This  means,  for  exam¬ 
ple,  that  optimal  linear  filters  cannot  be  realized  by  “lumped”  systems  when  the 
associated  covariance  functions  are  not  separable;  thus  it  is  not  possible  to  exploit 
nonlinear  behavior  to  obtain  optimal  linear  filters  for  more  general  signals  than  the 
class  to  which  the  well-known  Kalman  filter  may  be  applied.  For  nonlinear  filters 
that  are  described  by  a  Volterra  series  input-output  description,  a  similar  kind  of 
result  was  obtained  [4],  and  more  general  results  for  linear  filtering  in  colored  (corre¬ 
lated)  noise  processes  were  developed  using  the  theory  of  reproducing  kernel  Hilbert 
spaces  (RKHS)  [5].  The  general  importance  of  these  results  is  that  there  are  funda¬ 
mental  structural  limitations  imposed  on  the  input-output  behavior  of  any  well- 
behaved  finite  dimensional  nonlinear  system. 

A  second  area  of  work  was  in  the  formulation  of  Markov  field  models  for  spa¬ 
tially  interacting  systems,  [2]  and  [10].  A  number  of  models  were  proposed,  and  effi¬ 
cient  simulation  procedures  were  developed.  Several  conclusions  were  drawn  on  the 
basis  of  the  work.  First,  the  computational  demands  of  general  Markov  field  models 
are  extremely  severe,  and  this  suggests  that  either  approximation  methods  will  prob¬ 
ably  play  an  important  role  in  any  case  where  they  are  to  be  applied  or  massively 
parallel  computers  such  as  the  Connection  Machine  will  be  needed  to  handle  the 
demands.  With  no  such  parallel  computing  engine  available  for  experimental  work, 
research  relied  on  modest  simulation  studies,  and  special  attention  was  given  to  stu¬ 
dies  of  possibilities  for  time-scale  decomposition  and  state  aggregation.  Intriguing 
nonlinear  phenomena  similar  to  phase  transitions  in  quantum  physics  models  are 
observed  in  the  behavior  of  locally  interacting  systems. 

After  considerable  discussion,  including  the  valuable  one  at  the  ONR-sponsored 
workshop  on  submicron  systems,  it  appears  that  physical  models  for  interacting  quan¬ 
tum  systems  have  not  yet  reached  the  stage  where  applications  of  Markov  field 
models  for  large-scale  behavioral  modeling  are  realistic.  That  is  to  say,  the  physical 
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constraints  that  are  necessary  to  impose  on  this  general  form  of  empirical  model  are 
not  yet  •well  enough  understood.  However,  based  on  the  success  of  stochastic  models 
in  applications  such  as  binary  (and  gray-scale)  image  modeling,  we  continue  to  be 
optimistic  that  future  work  on  quantum  well  devices  and  systems  will  lead  to  some 
applications  for  Markov  field  models. 

An  attempt  to  use  Markov  field  models  in  an  application  involving  VLSI  sys¬ 
tems  was  made:  the  phenomenon  of  pattern  sensitive  faults  in  dynamic  memory 
devices  was  considered  [10].  However,  the  results  of  this  preliminary  analysis  are  not 
very  encouraging.  Several  comments  about  this  particular  problem  can  be  made. 
First,  conventional  work  on  testing  of  memory  chips  involves  the  design  of  a  suffi¬ 
ciently  rich  class  of  test  input  sequences.  The  use  of  statistical  modeling,  and  sys¬ 
tem  identification/parameter  estimation  methods  for  determining  faulty  chips  on  the 
basis  of  random  test  input  sequences,  is  a  different  approach  to  testing  which  is  not 
(yet)  widely  accepted,  although  statistical  models  for  phenomena  affecting  com¬ 
ponent  lifetimes  (e.g.  electromigration  of  metal  at  contacts)  are  common  in  reliability 
modeling.  Finally,  we  are  unaware  of  any  nonproprietary  data  on  real  chips  that 
could  be  used  in  evaluating  the  stochastic  modeling  approach  to  testing  in  a  realistic 
setting.  This  is  crucial  for  such  applications  where  it  is  desired  to  detect  rare  events 
(failures)  with  low  error  probabilities.  The  highly  developed  detection  systems  used 
in  radar  and  sonar  applications  have  evolved  from  an  extensive  amount  of  empirical 
and  analytical  work.  It  should  be  expected  that  applications  of  statistical  models  in 
VLSI  testing  will  require  a  similar  combinaf  >:  of  effort. 

The  general  work  done  in  the  area  of  interacting  systems  of  simple  elements 
suggested  that  various  distributed  computational  models  may  be  useful  in  signal  pro¬ 
cessing  applications.  The  same  conclusion  has  been  reached  by  many  other  research¬ 
ers  starting  from  a  variety  of  perspectives.  To  examine  this  kind  of  question  in  a 
specific  setting,  we  undertook  some  research  to  explore  the  structure  of  artificial 
neural  network  models  with  an  eye  towards  isolating  one  or  more  neural  network 
models  that  could  be  implemented  with  a  spatially  interacting  structure  of  the  type 
we  imagine  might  be  relevant  for  future  generation  high  density  VLSI  chips. 

The  first  major  accomplishment  was  a  rederivation  of  the  capacity  results  for 
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Hopfield  associative  memory  networks  [9],  [10].  This  problem  has  attracted  consider¬ 
able  attention,  partly  because  of  its  implications  that,  in  one  particular  but  Important 
sense,  the  number  of  stored  memories  that  can  be  achieved  on  average  is  disappoint¬ 
ingly  small  (growing  sublinearly  with  the  size  of  the  network).  A  new  analysis  was 
carried  out,  using  probabilistic  methods  related  to  the  Central  Limit  Theorem  for 
dependent,  exchangeable,  random  variables.  This  offers  an  advantage  over  the  ear¬ 
lier  combinatorial  approach  because  it  gives  a  way  of  exhibiting  how  constraints  on 
interconnections  and  how  stochastic  neuron  models  affect  asymptotic  capacity. 
Some  further  analysis,  giving  bounds  for  the  number  of  spurious  memories,  was  also 
carried  out  [llj. 

The  second  major  thrust  in  the  neural  networks  area  involved  a  family  of  struc¬ 
tured,  locally  interconnected  networks  based  on  trellis  graphs  associated  with  linear 
finite-state  systems.  Systems  of  this  kind  are  used  to  generate  convolutional  codes 
for  digital  communications.  It  was  found  that  the  combinatorial  optimization  prob¬ 
lem  of  finding  a  shortest  part  through  a  segment  of  a  trellis  graph  could  be  solved  by 
a  suitably  formulated  (Grossberg-type)  neural  network  [7],  [8],  [llj.  This  provides  a 
localized,  distributed  solution  to  the  shortest  path  problem  quite  different  in  spirit 
than  the  dynamic  programming  solution  (Viterbi  algorithm)  which  is  widely  used  in 
practice.  Furthermore,  this  use  of  neural  networks  seems  to  offer  a  particularly  nice 
method  for  representing  data  with  distributed  redundancy  and  natural  capabilities  for 
efficient  fault  tolerant  implementation. 

Finally,  there  was  some  research  work  that  built  upon  on-going  work  concerning 
analog  computation.  In  particular,  the  work  described  in  [6]  describes  a  framework 
for  analysis  of  the  complexity  of  physical  systems,  e.g.  electronic  circuits,  neural  net¬ 
works,  etc.,  based  on  the  recognized  (or  at  least  widely  believed)  computational 
intractability  of  certain  combinatorial  optimization  problems,  the  class  of  NP- 
complete  problems.  Since  neural  networks  and  more  general  nonlinear  circuits  have 
been  proposed  as  a  means  of  solving  NP-complete  problems  like  the  “Traveling 
Salesman  Problem,”  (by  Hopfield  and  Tank,  Chua,  and  others),  the  results  of  [6], 
namely  that  the  scaling  difficulties  encountered  by  such  solutions  as  the  problem  size 
increases,  are  quite  naturally  to  be  expected  on  the  basis  of  complexity  theory 
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We  ask  if  analog  computers  can  solve  NP-complete  problems  efficiently.  Regarding  this  as  unlikely,  we 
formulate  a  strong  version  of  Church’s  Thesis:  that  any  analog  computer  can  be  simulated  efficiently  (in 
polynomial  time)  by  a  digital  computer.  From  this  assumption  and  the  assumption  that  P  *  NP  we  can  draw 
conclusions  about  the  operation  of  physical  devices  used  for  computation. 

An  NP-complete  problem.  3-sat,  is  reduced  to  the  problem  of  checking  whether  a  feasible  point  is  a  local 
optimum  of  an  optimization  problem.  A  mechanical  device  is  proposed  for  the  solution  of  this  problem.  It 
encodes  variables  as  shaft  angles  and  uses  gears  and  smooth  cams.  If  we  grant  Strong  Church’s  Thesis,  that 
P  NP,  and  a  certain  “Downhill  Principle’’  governing  the  physical  behavior  of  the  machine,  we  conclude  that  it 
cannot  operate  successfully  while  using  only  polynomial  resources. 

We  next  prove  Strong  Church’s  Thesis  for  a  class  of  analog  computers  described  by  well-behaved  ordinary 
differential  equations,  which  we  can  take  as  representing  part  of  classical  mechanics. 

We  conclude  with  a  comment  on  the  recently  discovered  connection  between  spin  glasses  and  combinatonai 
optimization. 


1.  Introduction 

Analog  devices  have  been  used,  over  the  years,  to  solve  a  variety  of  problems.  Perhaps  most 
widely  known  is  the  Differential  Analyzer  [4,26],  which  has  been  used  to  solve  differential 
equations.  To  mention  some  other  examples,  in  [25]  an  electronic  analog  computer  is  proposed  to 
implement  the  gradient  projection  method  for  linear  programming.  In  [18]  the  problem  of  finding 
a  minimum-length  interconnection  network  between  given  points  in  the  plane  is  solved  with 
movable  and  fixed  pegs  interconnected  by  strings;  a  locally  optimal  solution  is  obtained  by 
pulling  the  strings.  Another  method  is  proposed  there  for  this  problem,  based  on  the  fact  that 
soap  films  form  minimal-tension  surfaces.  Many  other  examples  can  be  found  in  books  such  as 
[14]  and  [16],  including  electrical  and  mechanical  machines  for  solving  simultaneous  linear 
equations  and  differential  equations. 

*  This  work  was  supported  in  part  by  ONR  Grants  N00014-83-K-0275  and  N00014-83-K-0577,  NSF  Grant 

ECS-8120037,  U.S.  Army  Research-Durham  Grant  DAAG29-82-K-0095.  and  DARPA  Contract  N00014-82-K-0549. 
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Given  the  large  body  of  work  on  the  complexity  of  Turing-machine  computation,  and  the 
recent  interest  in  the  physical  foundations  of  computation,  it  .seems  natural  to  study  the 
complexity  of  analog  computation.  This  paper  pur.sues  the  following  line  of  reasoning;  it  is 
generally  regarded  as  likely  that  P  ^  NP  —  that  certain  combinatorial  problems  cannot  be  solved 
efficiently  bv  digital  computers.  (Here  we  use  the  term  efficieni  to  mean  that  the  time  used  by  an 
'ideal'  digital  computer  is  bounded  by  a  polynomial  function  of  the  size  of  the  task  description. 
See  [9]  for  discussion  of  this  criterion.)  We  may  ask  if  such  problems  can  be  solved  efficiently  by 
other  means,  in  particular,  by  machines  of  a  nature  different  from  digital  computers.  We  thus 
come  to  ask  if  NP-complete  problems  can  be  solved  efficiently  by  physical  devices  that  do  not 
use  binary  encoding  (or,  more  generally,  encoding  with  any  fixed  radix).  We  lump  such  devices 
together  under  the  term  analog  computer,  in  what  follows  we  will  use  the  term  analog  computer 
to  mean  any  deterministic  physical  device  that  uses  a  fixed  number  of  physical  variables  to 
represent  each  problem  variable.  This  description  is  admittedly  vague  and  certainly  non-mathe- 
matical — we  mean  it  to  capture  the  intuitive  notion  of  a  'non-digital'  computer.  (More  about  this 
in  the  next  section.) 

We  want  to  emphasize  that  the  question  of  whether  an  analog  computer  can  solve  an 
NP-complete  problem  ‘efficiently’  is  a  question  about  the  physical  world,  while  the  P  =  NP 
question  is  a  mathematical  one.  However,  mathematical  models  of  various  kinds  provide  a 
formalism  that  is  apparently  in>^ispensable  for  the  understanding  of  physical  phenomena.  An 
important  connection  between  the  mathematical  world  of  computation  and  the  physical  world  of 
computing  hardware  was  discussed  by  Church.  In  his  1936  paper  [6]  he  equated  the  intuitive 
notion  of  effective  calculability  with  the  two  equivalent  mathematical  characterizations  of 
X-definability  and  recursivity.  Turing  [28]  then  showed  that  this  notion  is  equivalent  to  computa¬ 
bility  by  what  we  have  come  to  call  a  Turing  machine,  so  that  the  intuitive  notion  of  effective 
calculability  is  now  characterized  mathematically  by  'Turing-Computability'.  This  is  generally 
referred  to  as  'Church's  Thesis’,  or  the  ‘Church-Turing  Thesis’.  In  our  context  we  express  this  as 
follows: 

Church’s  Thesis  (CT):  Any  analog  computer  with  finite  resources  can  be  simulated  by  a  digital 
computer. 

What  we  will  come  to  demand  is  more  than  that:  we  are  interested  in  efficient  computation, 
computation  that  does  not  use  up  resources  that  grow  exponentially  with  the  size  of  the  problem. 
This  requirement  leads  us  to  formulate  what  we  call 

Strong  Church’s  Thesis  (SCT):  Any  finite  analog  computer  can  be  simulated  efficiently  by  a 
digital  computer,  in  the  sense  that  the  lime  required  by  the  digital  computer  to  simulate  the 
analog  computer  is  bounded  by  a  polynomial  function  of  the  resources  used  by  the  analog 
computer. 

Evidently  we  will  need  to  give  a  characterization  of  analog  computers  and  the  resources  that 
they  use.  This  is  discussed  in  the  next  section.  Following  that,  we  argue  that  certain  numerical 
problems  are  inherently  difficult  (i.e.  not  polynomial)  for  analog  computers,  even  though  they  are 
easy  for  digital  computers. 

Something  like  our  Strong  Church’s  Thesis  was  discussed  recently  by  Feynman  [8]  in 
connection  with  the  problem  of  building  a  (digital)  computer  that  simulates  physics.  He  says: 
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"The  rule  of  simulation  that  1  would  like  to  have  is  that  the  number  of  computer  elements 
recired  to  simulate  a  large  physical  system  is  only  to  be  proportional  to  the  space-time 
volume  of  the  physical  system.  I  don't  want  to  have  an  e.xplosion." 

VVe  would  argue  that  ‘proportional  to’  be  replaced  by  ‘bounded  by  a  polynomial  function  of.  in 
the  spirit  of  modern  computational  complexity  theory. 

,A  class  of  mechanical  devices  is  proposed  in  Section  5.  Machines  in  this  class  can  be  used  to 
find  local  optima  for  mathematical  programming  problems.  We  formalize  the  physical  operation 
of  these  machines  by  a  certain  ‘Downhill  Principle'.  Basically,  it  states  that  if.  in  our  class  of 
mechanical  devices,  there  are  feasible  ‘downhill’  directions,  the  state  vector  describing  the 
physical  system  moves  in  such  a  direction.  We  also  discuss  measuring  the  resources  required  by 
these  machines. 

In  Section  6  we  reduce  3-sat  (the  problem  of  whether  a  Boolean  expression  in  3-conjunctive 
normal  form  has  a  satisfying  truth  assignment),  to  the  problem  of  checking  whether  a  given 
feasible  point  is  a  local  optimum  of  a  certain  mathematical  programming  problem.  This  shous 
that  merely  checking  for  local  optimality  is  NP-hard. 

In  Section  7  a  mechanical  device  in  the  class  mentioned  above  is  proposed  for  the  solution  of 
3-sat.  Naturally,  the  efficient  operation  of  this  machine  is  highly  suspect.  Be  careful  to  notice 
that  the  operation  of  any  machine  in  practice  is  a  physics  question,  not  a  question  susceptible  of 
ultimate  mathematical  demonstration.  Our  analysis  must  necessarily  be  based  on  an  idealized 
mathematical  model  for  the  machine.  However,  we  can  take  the  likelihood  of  P  NP.  plus  the 
likelihood  of  Strong  Church’s  Thesis,  as  evidence  that  in  fact  such  a  machine  cannot  operate  with 
polynomially  bounded  resources,  whatever  the  particular  laws  of  physics  happen  to  be. 

The  paradigm  that  emerges  from  this  line  of  reasoning  is  then  the  following; 

If  a  strongly  NP-complete  problem  can  be  solved  by  an  analog  computer,  and  if  P  NP.  and 
if  Strong  Church’s  Thesis  is  true  then  the  analog  computer  cannot  operate  successfully  with 
polynomial  resources. 

We  will  then  prove  a  restricted  form  of  Strong  Church’s  Thesis,  for  analog  computers  governed 
by  well-behaved  differential  equations.  This  suggests  that  any  interesting  analog  computer  should 
rely  on  some  strongly  nonlinear  behavior,  perhaps  arising  from  quantum  mechanical  mecha¬ 
nisms;  however,  the  problem  of  establishing  Strong  Church’s  Thesis  (or  even  the  Weak  Thesis)  in 
the  case  of  quantum-mechanical  or  probabilistic  laws  is  an  open  problem. 


2.  Some  terminology 

We  know  what  a  digital  computer  is;  Turing  has  laid  out  a  model  for  what  a  well-defined 
digital  computation  must  be;  it  uses  a  finite  set  of  symbols  (without  loss  of  generality  {0,1})  to 
store  information,  it  can  be  in  only  one  of  a  finite  set  of  stales,  and  it  operates  by  a  finite  set  of 
rules  for  moving  from  state  to  state.  Its  memory  tape  is  not  bounded  in  length  a  priori,  but  only  a 
finite  amount  of  tape  can  be  used  for  any  one  computation.  What  is  fundamental  about  the  idea 
of  a  Turing  Machine  and  digital  computation  in  general,  is  that  there  is  a  perfect  correspondence 
between  the  mathematical  model  and  what  happens  in  a  reasonable  working  machine.  Being 
definitely  in  one  of  two  states  is  easily  arranged  in  practice,  and  the  operation  of  real  digital 
computers  can  be  (and  usually  is)  made  very  reliable. 
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Abstract 

We  have  developed  a  neural  network  which  consists  of  cooperatively  inter¬ 
connected  Grossberg  on-center  ofF-surround  subnets  and  which  can  be  used  to 
optimize  a  function  related  to  the  log  likelihood  function  for  decoding  convolu¬ 
tional  codes  or  more  general  FIR  signal  deconvolution  problems.  Connections  in 
the  network  are  confined  to  neighboring  subnets,  and  it  is  representative  of  the 
types  of  networks  which  lend  themselves  to  VLSI  implementation.  Analytical  and 
experimental  results  for  convergence  and  stability  of  the  network  have  been  found. 

The  structure  of  the  network  can  be  used  for  distributed  representation  of  data 
items  while  allowing  for  fault  tolerance  and  replacement  of  faulty  units. 

1  Introduction 

In  order  to  study  the  behavior  of  locally  interconnected  networks,  we  have  focused 
on  a  class  of  “trellis-structured”  networks  which  are  similar  in  structure  to  multilayer 
networks  [5]  but  use  symmetric  connections  and  allow  every  neuron  to  be  an  output. 
We  are  studying  such  locally  interconnected  neural  networks  because  they  have  the 
potential  to  be  of  great  practical  interest.  Globally  interconnected  networks,  e.g., 
Hopfield  networks  [3],  are  difficult  to  implement  in  VLSI  because  they  require  many 
long  wires.  Locally  connected  networks,  however,  can  be  designed  to  use  fewer  and 
shorter  wires. 

In  this  paper,  we  will  describe  a  subclass  of  trellis-structured  networks  which  op¬ 
timize  a  function  that,  near  the  global  minimum,  has  the  form  of  the  log  Likelihood 
function  for  decoding  convolutional  codes  or  more  general  finite  impulse  response  sig¬ 
nals.  Convolutional  codes,  defined  in  section  2,  provide  an  alternative  representation 
scheme  which  can  avoid  the  need  for  global  connections.  Our  network,  described  in 
section  3,  can  perform  maximum  likelihood  sequence  estimation  of  convolutional  coded 
sequences  in  the  presence  of  noise.  The  performance  of  the  system  is  optimzd  for  low 
error  rates. 

The  specific  application  for  this  network  was  inspired  by  a  signal  decomposition 
network  described  by  Hopfield  and  Tank  [6].  However,  in  our  network,  there  is  an 
empha.'^'  on  local  interconnections  and  a  more  complex  neural  model,  the  Grossberg 
on-ccn'  •  off-surround  network  [2],  is  used.  A  modified  form  of  the  Gorssberg  model 
is  de  'J-  ’  in  section  4.  Section  5  presents  the  main  theoretical  results  of  this  paper. 
Although  ’^he  deconvolution  network  is  simply  a  set  of  cooperatively  interconnected 
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on-center  off-surround  subnetworks,  and  absolute  stability  for  the  individual  subnet¬ 
works  has  been  proven  [l],  the  cooperative  interconnections  between  these  subnets 
make  a  similar  proof  difficult  and  unlikely.  We  have  been  able,  however,  to  prove 
equiasymptotic  stability  in  the  Lyapunov  sense  for  this  network  given  that  the  gain 
of  the  nonlinearity  in  each  neuron  is  large.  Section  6  will  describe  simulations  of  the 
network  that  were  done  to  confirm  the  stability  results. 

2  Convolutional  Codes  and  MLSE 

In  an  error  correcting  code,  an  input  sequence  is  transformed  from  a  6-dimensional 
input  space  to  an  .^/-dimensional  output  space,  where  M  >  h  for  error  correction 
and/or  detection.  In  general,  for  the  6-bit  input  vector  U  =  (ui, .  and  the  M- 

bit  output  vector  V  =  (ui, . . . ,  ),  we  can  write  V  =  F{ui, . . . ,  ui,).  A  convolutional 

code,  however,  is  designed  so  that  relatively  short  subsequences  of  the  input  vector 
are  used  to  determine  subsequences  of  the  output  vector.  For  example,  for  a  rate  1/3 
convolutional  code  (where  M  as  36),  with  input  subsequences  of  length  3,  we  can  write 
the  output,  V  =  (vi,...,vi,)  for  v,  =  (u.-.i, u,,2, ^,,3),  of  the  encoder  as  a  convolution 
of  the  input  vector  U  =  (ui, . . ., ui,0,0)  and  three  generator  sequences 

go  =  (111)  gi  =  (110)  g2  =  (0  11). 

This  convolution  can  be  written,  using  modulo-2  addition,  as 

t 

V,-  =  ^  Ukgi-k  (1) 

fc=max(l,i— 2) 

In  this  example,  each  3-bit  output  subsequence,  v,,  of  V  depends  only  on  three 
bits  of  the  input  vector  ,  i.e.,  v,-  =  /(tii-2>Ui-i>  “i)-  In  general,  for  a  rate  1/n  code,  the 
constraint  length,  K,  is  the  number  of  bits  of  the  input  vector  that  uniquely  determine 
each  n-bit  output  subsequence.  In  the  absence  of  noise,  any  subsequences  in  the 
input  vector  separated  by  more  than  K  bits  (i.e.,  that  do  not  overlap)  will  produce 
subsequences  in  the  output  vector  that  are  independent  of  each  other. 

If  we  view  a  convolutional  code  as  a  special  case  of  block  coding,  this  rate  1/3, 
K  =  3  code  converts  a  6-bit  input  word  into  a  codeword  of  length  3(6  -f  2)  where 
the  2  is  added  by  introducing  two  zeros  at  the  end  of  every  input  to  “zero-out”  the 
code.  Equivalently,  the  coder  can  be  viewed  as  embedding  2*  memories  into  a 
dimensionai  space.  The  minimum  distance  between  valid  memories  or  codewords  in 
this  space  is  the  free  distance  of  the  code,  which  in  this  example  is  7.  This  implies 
that  the  code  is  able  to  correct  a  minimum  of  three  errors  in  the  received  signal. 

For  a  convolutional  code  with  constraint  length  K,  the  encoder  can  be  viewed  as 
a  finite  state  machine  whose  state  at  time  i  is  determined  by  the  K  —  1  input  bits, 
u,_fe, . . . ,  u,_i.  The*  encoder  can  also  be  represented  as  a  trellis  graph  such  as  the  one 
shown  in  figure  1  for  a  A'  =  3,  rate  1/3  code.  In  this  example,  since  the  constraint 
length  is  three,  the  two  bits  u,_2  and  u,_i  determine  which  of  four  possible  states  the 
encoder  is  in  at  time  i.  In  the  trellis  graph,  there  is  a  set  of  four  nodes  arranged  in  a 
vertical  column,  which  we  call  a  stage,  for  each  time  step  i.  Each  node  is  labeled  with 
the  associated  values  of  u,_2  and  Ui_i.  In  general,  for  a  rate  1/n  code,  each  stage  of 
the  trellis  graph  contains  2^“'  nodes,  representing  an  equal  number  of  possible  states. 
A  trellis  graph  which  contains  5  stages  therefore  fully  describes  the  operation  of  the 
encoder  for  time  steps  1  through  5.  The  graph  is  read  from  left  to  right  and  the  upper 
edge  leaving  the  right  side  of  a  node  in  stage  t  is  followed  if  u,  is  a  zero;  the  lower  edge 


stage  i-2  stage  i-1  stage  i  stage  i+1  stage  i+2 


Figure  1:  Part  of  the  trellis-code  representation  for  a  rate  1/3,  A'  =  3  convolutional 
code. 

if  u,  is  a  one.  The  label  on  the  edge  determined  by  m  is  v,,  the  output  of  the  encoder 
given  by  equation  1  for  the  subsequence  u,. 

Decoding  a  noisy  sequence  that  is  the  output  of  a  convolutional  coder  plus  noise 
is  typically  done  using  a  maximum  likelihood  sequence  estimation  (MLSE)  decoder 
which  is  designed  to  accept  as  input  a  possibly  noisy  convolutional  coded  sequence,  R, 
and  produce  as  output  the  maximum  UkeUhood  estimate,  V,  of  the  original  sequence, 
V.  If  the  set  of  possible  n(6+2)-bit  encoder  output  vectors  is  {X„,  ;  m  =  1, 
and  Xm,,  is  the  tth  n-bit  subsequence  of  Xm  and  rj  is  the  zth  n-bit  subsequence  of  R 
then 

b 

V  =  argmaxJJ  I  Xm,i)  (2) 

Am 

That  is,  the  decoder  chooses  the  Xm  that  maximizes  the  conditional  probability,  given 
Xmi  of  the  received  sequence. 

A  binary  symmetric  channel  (BSC)  is  an  often  used  transmission  channel  model  in 
which  the  decoder  produces  output  sequences  formed  from  an  alphabet  containing  two 
symbols  and  it  is  assumed  that  the  probability  of  either  of  the  symbols  being  affected 
by  noise  so  that  the  other  symbol  is  received  is  the  same  for  both  symbols.  In  the 
case  of  a  BSC,  the  log  of  the  conditional  probability,  P(r,'  |  Xm,i),  is  a  linear  function 
of  the  Hamming  distance  between  and  Xm,t  so  that  maximizing  the  right  side  of 
equation  2  is  equivalent  to  choosing  the  Xm  that  has  the  most  bits  in  common  with 
R.  Therefore,  equation  2  can  be  rewritten  as 

6  n 

V  =  argmax^  (3) 

»•=!  /=! 

where  Xm,i,/  is  the  /th  bit  of  the  ith  subsequence  of  Xm  and  /a(h)  is  the  indicator 
function:  Ia{b)  =  1  if  and  only  if  a  equals  b. 

For  the  general  case,  maximum  likelihood  sequence  estimation  is  very  expensive 
since  the  number  of  possible  input  sequences  is  exponential  in  b.  The  Viterbi  algo¬ 
rithm  [7],  fortunately,  is  able  to  take  advants^e  of  the  structure  of  convolutional  codes 
and  their  trellis  graph  representations  to  reduce  the  complexity  of  the  decoder  so  that 


it  is  only  exponential  in  K  (in  general  K  <  b).  An  optimum  version  of  the  Viterbi  al¬ 
gorithm  examines  all  b  stages  in  the  trellis  graph,  but  a  more  practical  and  very  nearly 
optimum  version  typically  examines  approximately  5A'  stages,  beginning  at  stage  t, 
before  making  a  decision  about 


3  A  Network  for  MLSE  Decoding 


The  structure  of  the  network  that  we  have  defined  strongly  reflects  the  structure  of  a 
trellis  graph.  The  network  usually  consists  of  5/v  subnetworks,  each  containing  2^"* 
neurons.  Each  subnetwork  corresponds  to  a  stage  in  the  trellis  graph  and  each  neuron 
to  a  state.  Each  stage  is  implemented  ais  an  “on-center  off-surround”  competitive 
network  [2],  described  in  more  detail  in  the  next  section,  which  produces  a.s  output  a 
contrast  enhanced  version  of  the  input.  This  contrast  enhancement  creates  a  “winner 
take  all”  situation  in  which,  under  normal  circumstances,  only  one  neuron  in  each 
stage  — the  neuron  receiving  the  input  with  greatest  magnitude  —  will  be  on.  The 
activation  pattern  of  the  network  after  it  reaches  equilibrium  indicates  the  decoded 
sequence  as  a  sequence  of  “on”  neurons  in  the  network.  If  the  j-th  neuron  in  subnet  i, 
A/1  j  is  on,  then  the  node  representing  state  j  in  stage  i  lies  on  the  network’s  estimate 
of  the  most  likely  path. 

For  a  rate  1/n  code,  there  is  a  symmetric  cooperative  connection  between  neurons 
and  A/1+1,*  if  there  is  an  edge  between  the  corresponding  nodes  in  the  trellis 
graph.  If  (ii,j,*,i, . . .  are  the  encoder  output  bits  for  the  transition  between 

these  two  nodes  and  (r,,i, . . . ,  ri,„)  are  the  received  bits,  then  the  connection  weight 
for  the  symmetric  cooperative  connection  between  A/l,j  and  A/l+i,*  is 


1  " 

rrtt,i,fc  ~  ~~  yZ 

/=l 


(4) 


If  there  is  no  edge  between  the  nodes,  then  =  0. 

Intuitively,  it  is  easiest  to  understand  the  action  of  the  entire  network  by  exam¬ 
ining  one  stage.  Consider  the  nodes  in  stage  i  of  the  trellis  graph  and  assume  that 
the  conditional  probabilities  of  the  nodes  in  stages  i  —  1  and  i  +  1  are  known.  (All 
probabilities  are  conditional  on  the  received  sequence.)  Then  the  conditional  proba¬ 
bility  of  each  node  in  stage  i  is  simply  the  sum  of  the  probabilities  of  each  node  in 
stages  i  —  1  and  i  -f- 1  weighted  by  the  conditional  transition  probabilities.  If  we  look 
at  stage  t  in  the  network,  and  let  the  outputs  of  the  neighboring  stages  i  —  1  and 
I  -f-  1  be  fixed  with  the  output  of  each  neuron  corresponding  to  the  “likelihood”  of 
the  corresponding  state  at  that  stage,  then  the  final  outputs  of  the  neurons  A/l,j  will 
correspond  to  the  “likelihood”  of  each  of  the  corresponding  states.  At  equilibrium,  the 
neuron  corresponding  to  the  most  likely  state  will  have  the  largest  output. 

4  The  Neural  Model 

The  “on-center  off-surround”  network[2]  is  used  to  model  each  stage  in  our  network. 
This  model  allows  the  output  of  each  neuron  to  take  on  a  range  of  values,  in  this 
case  between  zero  and  one,  and  is  designed  to  support  contrast  enhancement  and 
competition  between  neurons.  The  model  also  guarantees  that  the  finad  output  of 
each  neuron  is  a  function  of  the  relative  intensity  of  its  input  as  a  fraction  of  the  total 
input  provided  to  the  network. 


Using  the  “on-center  off-surround”  model  for  each  stage  and  the  interconnection 
weights,  rriij^k,  defined  in  equation  4,  the  differential  equation  that  governs  the  in¬ 
stantaneous  activity  of  the  neurons  in  our  deconvolution  network  with  5  stages  and 
N  states  in  each  stage  can  be  written  as 

=  -Auij  +  {8- 

-  (C  -h  +  ^[’7tt-i,u/(“i-i,fe)  +  mij,kfiui+i.k)]] 

ki:i  ^  /=!  ' 

where  /(i)  =  (1  +  A  is  the  gain  of  the  nonlinearity,  and  A,  B,  and  C  are 

constants 

For  the  analysis  to  be  presented  in  section  5,  we  note  that  equation  5  can  be 
rewritten  more  compactly  in  a  notation  that  is  similar  to  the  equation  for  additive 
analog  neurons  given  in  [4]: 

S  N 

u,-,j  =  -  -  Tij,k,ifiuk,i))  (6) 

k=l  l=l 


where,  for  1  <  /  <  ^, 


e  ^  -CYl  (7) 

Si,j,i^U  =  Tirni,<,,i  qjtj 


To  eliminate  the  need  for  global  interconnections  within  a  stage,  we  can  add  two 
summing  elements  to  calculate 

N  N  N 

Xi  =  y  ]  and  Ji  =  ^  1  ^  '^i,j,kf{'^i+l,k)]  (8) 

J=1  i=l  fe=l 

Using  these  two  sums  allows  us  to  rewrite  equation  5  as 

Ui,ji  =  ~AU|,j  -|-  {B  -f-  C)(/(uij)  -l-  lij)  —  Uij{Xi  -f  Ji)  (9) 

This  form  provides  a  more  compact  design  for  the  network  that  is  particularly  suited 
to  implementation  as  a  digital  filter  or  for  use  in  simulations  since  it  greatly  reduces 
the  calculations  required. 


5  Stability  of  the  Network 

The  end  of  section  3  described  the  desired  operation  of  a  single  stage,  given  that  the 
outputs  of  the  neighboring  stages  are  fixed.  It  is  possible  to  show  that  in  this  situation 
a  single  stage  is  stable.  To  do  this,  fix  /(«*,/)  for  €  {*  -  1, »  +  1}  so  that  equation  6 
can  be  written  in  the  form  originally  proposed  by  Grossberg  [2]: 

.N  N  , 

Uij  =  -Auij  +  iB-  Uij)  {Ii,j  A  f{ui,j))  -  («,■.,•  +  C)[Y,  Ii,k  +  E  f{ui,k)  ( 10) 

Jt=i  ' 


where  =  YliL\ 

Equation  10  is  a  special  case  of  the  more  general  nonlinear  system 


Xi  =  a,(ii)f6i(x,)  -  "^Ci^kdkixk)] 
^  k=i  ' 


where:  (1)  ai{xi)  is  continuous  and  a,(ar,)  >  0  for  x,  >  0;  (2)  6,(z,)  is  continuous 
for  Xi  >  0;  (3)  c.,*  =  Cfc,,;  and  (4)  di(xi)  >  0  for  all  i,  £  (-00,00).  Cohen  and 
Grossberg  [1]  showed  that  such  a  system  has  a  global  Lyapunov  function: 

=  -  E  r  i  E  E  c,,kd,ix,)dkixk)  (12) 

and  that,  therefore,  such  a  system  is  equiasymptotically  stable  for  all  constants  and 
functions  satisfying  the  four  constraints  above.  In  our  case,  this  means  that  a  single 
stage  has  the  desired  behavior  when  the  neighboring  stages  are  fixed.  If  we  take  the 
output  cf  each  neuron  to  correspond  to  the  likelihood  of  the  corresponding  state  then, 
if  the  two  neighboring  stages  are  fixed,  stage  i  will  converge  to  an  equilibrium  point 
where  the  neuron  receiving  the  largest  input  will  be  on  and  the  others  will  be  off,  just 
as  it  should  according  to  section  2. 

It  does  not  seem  possible  to  use  the  Cohen-Grossberg  stability  proof  for  the  entire 
system  in  equation  5.  In  fact,  Cohen  and  Grossberg  note  that  networks  which  allow 
cooperative  interactions  define  systems  for  which  no  stability  proof  exists  [l]. 

Since  an  exact  stability  proof  seems  unlikely,  we  have  instead  shown  that  in  the 
limit  as  the  gain.  A,  of  the  nonlinearity  gets  large  the  system  is  asymptotically  stable. 
Using  the  notation  in  [4],  define  Vj;  =  /(u<)  and  a  normalized  nonlinearity  /(•)  such 
that  f~^(Vi)  =  Au,.  Then  we  can  define  an  energy  function  for  the  deconvolution 
network  to  be 

-I  E  TiMVijV.j-Y.U-A-Y.Si.i.k.n.)  (13) 

^  iJXI  iJ  ^  k.l  ^ 

The  time  derivative  of  E  is 
dV-  / 

^  =  -  E  -jf  (  -‘‘“i.l  -  “w  E  +  E 

id  ^  k,l  k,l  .  s 

-tEs^-w/  /-'(OdC) 

''  k,l  ■'T  ' 

It  is  difficult  to  prove  that  E  is  nonpositive  because  of  the  last  term  in  the  parentheses. 
However,  for  large  gain,  this  term  can  be  shown  to  have  a  negligible  effect  on  the 
derivative. 

It  can  be  shown  that  for  /(«)  =  (1  +  c“^“)"^,  ^~^(OdC  is  bounded  above 

by  log(2).  In  this  deconvolution  network,  there  are  no  connections  between  neurons 
unless  they  are  in  the  same  or  neighboring  stages,  i.e.,  =  0  for  |t  -  A;|  >  1  and 

/  is  restricted  so  that  0  <  /  <  5,  so  there  are  no  more  than  35  non-zero  terms  in  the 
problematical  summation.  Therefore,  we  can  write  that 


Then,  in  the  limit  as  A  — »  oo,  the  terms  in  parentheses  in  equation  14  converge  to  u^ 
in  equation  6,  so  that  lim  E  =  ^2  Using  the  chain  rule,  we  can  rewrite  this 


as 


'j 


It  can  also  be  shown  that  that,  if  /(•)  is  a  monotonicaUy  increasing  function  then 
>  0  for  all  Vi.  This  implies  that  for  all  u  =  ,  UAr,5),  limA— oo  E  <  0, 

and,  therefore,  for  large  gains,  E  as  defined  in  equation  13  is  a  Lyapunov  function  for 
the  system  described  by  equation  5  and  the  network  is  equiasymtotically  stable. 

If  we  apply  a  similar  asymptotic  argument  to  the  energy  function,  equation  13 
reduces  to 


(15) 


which  is  the  Lyapunov  function  for  a  network  of  discontinuous  on-off  neurons  with 
interconnection  matrix  T.  For  the  binary  neuron  case,  it  is  fairly  straight  forward  to 
show  that  the  energy  function  has  minima  at  the  desired  decoder  outputs  if  we  assume 
that  only  one  neuron  in  each  stage  may  be  on  and  that  B  and  C  are  appropriately 
chosen  to  favor  this.  However,  since  there  are  O(S^N)  terms  in  the  disturbance 
summation  in  equation  15,  convergence  in  this  case  is  not  as  fast  as  for  the  derivative 
of  the  energy  function  in  equation  13,  which  has  only  0(S)  terms  in  the  summation. 


6  Simulation  Results 

The  simulations  presented  in  this  section  are  for  the  rate  1/3,  K  =  3  convolutional  code 
illustrated  in  figure  1.  Since  this  code  has  a  constraint  length  of  3,  there  are  4  possible 
states  in  each  stage  and  an  MLSE  decoder  would  normally  examine  a  minimum  of 
bK  subsequences  before  making  a  decision,  we  will  use  a  total  of  16  stages.  In  these 
simulations,  the  first  and  last  stage  are  fixed  since  we  assume  that  we  have  prior 
knowledge  or  a  decision  about  the  first  stage  and  zero  knowledge  about  the  last  stage. 
The  transmitted  codeword  is  assumed  to  be  all  zeros. 

The  simulation  program  reads  the  received  sequence  from  standard  input  and  uses 
it  to  define  the  interconnection  matrix  W  according  to  equation  4.  A  relaxation 
subroutine  is  then  called  to  simulate  the  performance  of  the  network  according  to  an 
Euler  discretization  of  equation  5.  Unit  time  is  then  defined  as  one  RC  time  constant  of 
the  unforced  system.  All  variables  were  defined  to  be  single  precision  (32  bit)  floating 
point  numbers. 

Figure  2a  shows  the  evolution  of  the  network  over  two  unit  time  intervals  with  the 
sampling  time  T  =  0.02  when  the  received  codeword  contains  no  noise.  To  interpret 
the  figure,  recall  that  there  are  16  stages  of  4  neurons  each.  The  output  of  each  stage 
is  a  vertical  set  of  4  curves.  The  upper-left  set  is  the  output  of  the  first  stage;  the 
upper-most  curve  is  the  output  of  the  first  neuron  in  the  stage.  For  the  first  stage, 
the  first  neuron  has  a  fixed  output  of  1  and  the  other  neurons  have  a  fixed  output  of 
0.  The  outputs  of  the  neurons  in  the  last  stages  are  fixed  at  an  intermediate  value  to 
represent  zero  a  priori  knowledge  about  these  states.  Notice  that  the  network  reaches 
an  equilibrium  point  in  which  only  the  top  neurons  in  each  state  (representing  the  “00” 
node  in  figure  1)  are  on  and  all  others  are  off.  This  case  illustrates  that  the  network 
can  correctly  decode  an  unerrored  input  and  that  it  does  so  rapidly,  i.e.,  in  about  one 
time  constant.  In  this  case,  with  no  errors  in  the  input,  the  network  performs  the 


Figure  2:  Evolution  of  the  trellis  network  for  (a)  unerrored  input,  (b)  input  with  burst 
errors:  R  is  000  000  000  000  000  000  000  000  111  000  000  000  000  000  000.  A  =  10., 
A  =  1.0,  B  =  1.0,  C  =  0.75,  T  =  0.02.  The  initial  conditions  are  ii,i  =  1.,  Xij  =  0.0, 
xi6,j  =  0-2,  all  other  Xij  =  0.0. 


same  function  as  Hopfield  and  Tank’s  network  and  does  so  quite  well.  Although  we 
have  not  been  able  to  prove  it  analytically,  all  our  simulations  support  the  conjecture 
that  if  x,j(0)  =  5  for  all  t  and  j  then  the  network  will  always  converge  to  the  global 
minimum. 

One  of  the  more  difficult  decoding  problems  for  this  network  is  the  correction  of 
a  burst  of  errors  in  a  transition  subsequence.  Figure  2b  shows  the  evolution  of  the 
network  when  three  errors  occur  in  the  transition  between  stages  9  and  10.  Note  that 
10  unit  time  intervals  are  shown  since  complete  convergence  takes  much  longer  than 
in  the  first  example.  However,  the  network  has  correctly  decoded  many  of  the  stages 
far  from  the  burst  error  in  a  much  shorter  time. 

If  the  received  codeword  contains  scattered  errors,  the  convolutionad  decoder  should 
be  able  to  correct  more  than  3  errors.  Such  a  case  is  shown  in  figure  3a  in  which  the 
received  codeword  contains  7  errors.  The  system  takes  longest  to  converge  around  two 
transitions,  5-6  and  11-12.  The  first  is  in  the  midst  of  consecutive  subsequences  which 
each  have  one  bit  errors  and  the  second  transition  contains  two  errors. 

To  illustrate  that  the  energy  function  shown  in  equation  13  is  a  good  candidate 
for  a  Lyapunov  function  for  this  network,  it  is  plotted  in  figure  3b  for  the  three  cases 
described  above.  The  nonlinearity  used  in  these  simulations  has  a  gain  of  ten,  and,  as 
predicted  by  the  large  gain  limit,  the  energy  decreases  monotonicaUy. 

To  more  thoroughly  explore  the  behavior  of  the  network,  the  simulation  program 
was  modified  to  test  many  possible  error  patterns.  For  one  and  two  errors,  the  program 
exhaustively  tested  each  possible  error  pattern.  For  three  or  more  errors,  the  errors 
were  generated  randomly.  For  four  or  more  errors,  only  those  errored  sequences  for 
which  the  MLS  estimate  was  the  sequence  of  all  zeros  were  tested.  The  results  of 
this  simulation  are  summarized  in  the  column  labeled  ‘*two-nearest”  in  figure  4.  The 
performance  of  the  network  is  optimum  if  no  more  than  3  errors  are  present  in  the 
received  sequence,  however  for  four  or  more  errors,  the  network  fails  to  correctly  decode 
some  sequences  that  the  MLSE  decoder  can  correctly  decode. 


Figure  3:  (a)  Evolution  of  the  trellis  network  for  input  with  distributed  errors.  The 
input,  R,  is  000  010  010  010  100  001  000  000  000  000  110  000  000  000  000.  The 
constants  and  initial  conditions  are  the  same  as  in  figure  2.  (b)  The  energy  function 
defined  in  equation  13  evaulated  for  the  three  simulations  discussed. 
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Figure  4;  Simulation  results  for  a  deconvolution  network  for  a  =  3,  rate  1/3  code. 
The  network  parameters  were:  A  =  15,  A  =  6,  .0  =  1,  C  =  0.45,  and  T  =  0.025. 


For  locally  interconnected  networks,  the  major  concern  is  the  flow  of  information 
through  the  network.  In  the  simulations  presented  until  now,  the  neurons  in  each  stage 
are  connected  only  to  neurons  in  neighboring  stages.  A  modified  form  of  the  network 
was  also  simulated  in  which  the  neurons  in  each  stage  are  connected  to  the  neurons 
in  the  four  nearest  neighboring  stages.  To  implement  this  network,  the  subroutine  to 
initialize  the  connection  weights  was  modified  to  assign  a  non-zero  value  to 
This  is  straight-forward  since,  for  a  code  with  a  constraint  length  of  three,  there  is  a 
single  path  connecting  two  nodes  a  distance  two  apart. 

The  results  of  this  simulation  are  shown  in  the  column  labeled  “four-nearest”  in 
figure  4.  It  is  easy  to  see  that  the  network  with  the  extra  connections  performs  better 


than  the  previous  network.  Most  of  the  errors  made  by  the  nearest  neighbor  network 
occur  for  inputs  in  which  the  received  subsequences  r,  and  r,+i  or  r,+2  contain  a  total 
of  four  or  more  errors.  It  appears  that  the  network  with  the  additional  connections 
is,  in  effect,  able  to  communicate  around  subsequences  containing  errors  that  block 
communications  for  the  two-nearest  neighbor  network. 

7  Summary  and  Conclusions 

VVe  have  presented  a  locally  interconnected  network  which  minimizes  a  function  that 
is  analogous  to  the  log  likelihood  function  near  the  global  minimum.  The  results  of 
simulations  demonstrate  that  the  network  can  successfully  decode  input  sequences 
containing  no  noise  at  least  as  well  as  the  globally  connected  Hopfield-Tank  [6]  de¬ 
composition  network.  Simulations  also  strongly  support  the  conjecture  that  in  the 
noiseless  case,  the  network  can  be  guaranteed  to  converge  to  the  global  minimum.  In 
addition,  for  low  error  rates,  the  network  can  also  decode  noisy  received  sequences. 

We  have  been  able  to  apply  the  Cohen-Grossberg  proof  of  the  stability  of  “on- 
center  off-surround”  networks  to  show  that  each  stage  will  maximize  the  desired  local 
“likelihood”  for  noisy  received  sequences.  We  have  also  shown  that,  in  the  large  gain 
limit,  the  network  as  a  whole  is  stable  and  that  the  equilibrium  points  correspond  to 
the  MLSE  decoder  output.  Simulations  have  verified  this  proof  of  stability  even  for  rel¬ 
atively  small  gains.  Unfortunately,  a  proof  of  strict  Lyapunov  stability  is  very  difficult, 
and  may  not  be  possible,  because  of  the  cooperative  connections  in  the  network. 

This  network  demonstrates  that  it  is  possible  to  perform  interesting  functions  even 
if  only  localized  connections  are  allowed,  although  there  may  be  some  loss  of  perfor¬ 
mance.  If  we  view  the  network  as  an  associative  memory,  a  trellis  structured  network 
that  contains  NS  neurons  can  correctly  recall  2^  memories.  Simulations  of  trellis  net¬ 
works  strongly  suggest  that  it  is  possible  to  guarantee  a  non-zero  minimum  radius  of 
attraction  for  all  memories.  We  are  currently  investigating  the  use  of  trellis  structured 
layers  in  multilayer  networks  to  explicitly  provide  the  networks  with  the  ability  to 
tolerate  errors  and  replace  faulty  neurons. 
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Abstract 

We  have  developed  a  neural  or  eonneelionitt  network  which 
optimizee  a  function  having  the  form  of  the  log  likelihood 
function  for  the  output  sequence  of  a  binary  symmetric  chan¬ 
nel  whose  input  comes  from  a  convolutional  code.  The  net¬ 
work  may  be  applied  to  more  general  FIR  deconvolution 
problems.  It  requires  mainly  localized  connections  and  is 
intended  to  represent  the  type  of  networks  which  lend  them¬ 
selves  to  VLSI  implementation.  Analytical  and  empirical 
results  on  network  performance  and  stability  are  described. 


1  Introduction 

One  of  the  often  cited  problems  in  trying  to  implement 
neural  networks,  particularly  of  the  Hopfield  type  (ij, 
in  VLSI  is  that  the  networks  generally  require  global 
interconnections.  This  causes  difliculties,  requiring  long 
wires  to  connect  elements  that  are  far  apart.  It  would 
be  much  easier  to  design  elTicient  VLSI  implementations 
if  connections  could  be  restricted  to  be  between  only 
elements  that  are  "close”  together. 

While  it  is  possible  to  impose  a  locality  requirement 
on  a  network  which  might  ordinarily  have  global  inter¬ 
connections,  the  performance  will  suffer.  For  a  vari¬ 
ant  of  Hopfield  associative  memory  networks,  it  has 
been  shown  |2]  that  capacity  decreases  in  proportion 
to  the  maximum  allowable  distance  between  connected 
elements.  It  would  be  desirable  to  design  networks  in 
such  a  way  that  the  locality  constraint  is  initially  sat¬ 
isfied,  preferably  exploiting  the  underlying  structure  of 
the  problem. 

We  have  developed  a  network  which  optimises  a  func¬ 
tion  that  has  the  form  of  the  likeliho^  function  for 
decoding  convolutional  codes  or  more  general  FIR  sig¬ 
nal  deconvolution.  The  structure  of  the  network  reflects 
the  structure  of  the  trellis  representation  of  the  convolu¬ 
tional  code  and  therefore  has  the  desired  locality  prop¬ 
erty.  The  locality  of  the  final  network  is  also  enhanced 
by  the  choice  of  neural  model  used  for  each  element. 


'Work  supported  by  the  Office  of  Naval  Research  through  grant 
N00014-83-K-0t77  and  by  the  National  Science  Foundation 
through  grant  ECS84-0S460. 


State 


Figure  1:  Part  of  the  trellis-code  representation  for  a 
rate  1/3,  if  =  3  convolutional  code. 

2  The  Network 

Consider  the  trellis  graph,  shown  in  figure  1,  for  a  con¬ 
volutional  code  that  is  to  be  searched  by  a  maximum 
likelihood  estimator  such  as  the  Viterbi  decoder  [3].  For 
a  rate  1/n  convolutional  code  with  constraint  length 
K,  if  we  force  a  decision  after  BK  stages,  the  trellis 
graph  contains  BK  stages,  each  containing  states. 
The  Viterbi  algorithm  (or  any  other  MLE  algorithm) 
must  choose  the  path  through  the  trellis  that  has  the 
maximum  likelihood  of  being  correct  given  a  (possibly 
noisy)  received  bit  sequence.  Assuming  a  binary  sym¬ 
metric  channel,  we  can  assign  a  weight  to  each  edge 
in  the  trellis  graph  that  is  proportional  to  the  number 
of  matching  bits  in  the  received  bit  sequence  and  the 
expected  sequence  for  that  edge.  The  maximum  likeli¬ 
hood  estimate  in  this  case  is  equivalent  to  the  path  with 
the  greatest  cumulative  weight. 

Figure  1  corresponds  to  a  rate  1/3  time  invariant  con¬ 
volutional  code  with  a  constraint  length  of  3  where  the 
generator  sequences  are 

go  =  (1  1  1)  gi  =  (1  1  0)  gj  =  (0  1  1). 

For  an  input  sequence  u  =  (ui , . . . ,  ut,  0, 0) ,  the  encoder 
output  is  V  =  (vi, . . . ,  VS43).  The  output  after  the  zth 


input  bit  hu  entered  the  encoder  is  Vi  =  Vi3,Vi3) 
where  (using  niodulo-2  addition) 

* 

Vi  =  «fc8-* 

3) 

Notice  that  this  code,  for  a  fixed  length  input  con¬ 
taining  6  bits,  converts  the  6-bit  input  words  into  a 
codeword  of  length  3(6  -t-  2)  where  the  2  is  added  by 
introducing  two  zeros  at  the  end  of  every  input  to  “sero- 
out”  the  code.  Equivalently,  the  coder  can  be  viewed 
as  embedding  2*’  memories  into  a  2^**’'''^*-dimensional 
space.  The  minimum  distance  between  valid  memories 
or  codewords  in  this  space  is  the  free  diilanee  of  the 
code,  whicli  in  this  example  is  7.  This  implies  that  the 
code  is  able  to  correct  a  minimum  of  three  errors  in 
the  received  signal. 

A  MLE  decoder  is  designed  to  accept  as  input  a  pos¬ 
sibly  noisy  coded  sequence,  r,  and  produce  as  output 
the  maximum  likeliliood  estimate,  '0,  of  the  original  se¬ 
quence,  V.  If  the  set  of  possil  le  2(6  -t-  2)-bit  encoder 
output  vectors  is  (xm  :  m  =  1,  ...,2^1’’‘''*1}  and  Xm.i  i* 
the  <th  n-bit  subsequence  of  x,ni 

N 

V  =  arg  max  p(r< |x,„,<) 

"*  «=i 

In  the  case  of  a  binary  symmetric  channel,  this  is  equiv¬ 
alent  to 

N 

❖  =  argmax^A((r<,Xm.<)  (1) 

lasl 

>i(a,  6)  is  the  number  of  matching  bits  in  a  and  6. 

We  have  defined  a  neural  network  which  corresponds 
to  the  trellis  used  for  the  MLE  estimation  defined  above. 
Each  stage  of  the  trellis,  representing  the  set  of  possible 
states  at  each  time  instant  at  position  t,  is  implemented 
as  an  *on-center  oif-surround''  competitive  network  (4]. 
This  network  will  be  described  in  more  detail  in  the 
next  section  but  for  now  it  sulGces  to  know  that  it  will 
produce  a  contrast  enhanced  version  of  the  input. 

The  edges  in  the  trellis  graph  correspond  to  coopera¬ 
tive  connections  between  neurons.  In  addition  to  these 
cooperative  connections,  it  is  sometimes  helpful  to  add 
inhibitory  connections  between  unconnected  nodes  in 
the  trellis  graph,  since  these  transitions  can  not  occur 
in  the  final  path.  All  connections  in  this  model  are  as¬ 
sumed  to  be  symmetric.  The  weights  assigned  to  each 
connection  in  this  model  vary  with  each  problem  in¬ 
stance;  i.e.,  for  each  received  sequence  or  subsequence 
the  weights  may  be  different. 

More  precisely,  for  a  rate  1/n  code,  if  there  is  an 
edge  between  nodes  Mij  and  in  the  graph,  and  if 

(*t,y.fc,ii  •  •  ■  >  «■«  the  encoder  output  bits  for  the 

transition  between  these  two  nodes  and  (rt,t,...,rt,n) 
are  the  received  bits  for  this  transition,  then  there  is 


a  symmetric  cooperative  connection  between  these  two 
neurons  and  the  associated  weight  is 

1  " 

H  (Xij.k.l)  (2) 

If  there  is  no  edge  between  Mij  and  X+i.Jb,  then 
Mij,k  =  0.  /a  (6)  is  the  indicator  function: 


For  more  general  sequence  estimation  or  FIR  sig¬ 
nal  deconvolution,  e.g.,  MLE  sequence  estimation  in 
the  presence  of  intersymbol  interference,  the  connection 
weights  would  be  determined  by  a  similar  function.  If 
and  r,'  represent  the  encoded  and  received  signals, 
then  the  connection  weight  can  be  any  monotonically 
decreasing  function  of  the  distance  between  the  two  sig¬ 
nals.  For  a  system  in  which  there  is  no  noise  corrupting 
the  received  signal,  analysis  of  the  network  is  simplified 
by  choosing  this  function  to  be  a  step  function  which  is 
1  if  und  0  otherwise. 

The  varying  input  weights  required  by  this  network 
can  be  implemented  in  at  least  two  ways.  To  obtain 
a  model  in  the  neural  network  spirit,  the  received  bits 
would  be  applied  to  input  neurons  whose  output  is  pro¬ 
portional  to  the  degree  of  match  between  the  expected 
and  actual  inputs.  The  output  of  these  neurons  would 
then  modify  the  signals  on  the  cooperative  connections 
at  multiplicative  synapses.  Such  synapses  have  been 
observed  in  biological  systems.  This  method  has  the 
advantage  of  requiring  relatively  simple  neurons  for  the 
trellis  since  their  input  weights  would  be  fixed.  Alter¬ 
natively,  the  weights  used  by  each  neuron  to  calculate 
its  output  can  change  with  each  input.  This  method  is 
probably  the  easiest  to  use  for  digital  implementations 
and  is  the  one  used  in  our  simulations.  Observe  that  in 
either  of  these  cases,  the  information  required  to  calcu¬ 
late  the  weight  is  local  at  each  edge  of  the  trellis  graph 
and  therefore  at  each  connection  in  the  network. 

Intuitively,  it  is  easiest  to  understand  the  action  of 
the  entire  network  by  examining  one  stage.  Consider 
the  nodes  in  stage  t  of  the  trellis  graph  and  assume 
that  the  conditional  probabilities  of  the  nodes  in  stages 
s  —  1  and  1-1-1  are  known.  Then  the  conditional  prob¬ 
ability  of  each  node  in  stage  i  is  simply  the  sum  of  the 
probabilities  of  each  node  in  t  —  1  and  t  -b  1  weighted 
by  the  transition  probabilities.  If  we  look  at  stage  i  in 
the  network,  and  again  let  the  neighboring  stages  t  —  1 
and  t-1- 1  be  fixed  with  the  output  of  each  neuron  corre¬ 
sponding  to  the  "likelihood”  of  the  corresponding  state 
at  that  stage,  then  the  final  outputs  of  the  neurons 
will  correspond  to  the  "likelihood”  of  each  of  the  corre¬ 
sponding  states.  When  the  stage  reaches  equilibrium, 
the  neuron  corresponding  to  the  most  likely  state  will 
have  the  largest  output. 


3  The  Neural  Model 


In  the  previous  section,  we  defined  the  problem  to  be 
solved  by  this  network  and  the  connections  to  be  used. 
These  requirements  place  some  restrictions  on  the  neu¬ 
ral  models  that  can  be  used.  The  model  used  in  this 
network,  called  an  “on-center  olT-surround”  network  be¬ 
cause  the  output  of  each  neuron  in  the  network  is  used 
as  positive  feedback  to  itself  and  negative  feedback  to 
all  the  other  neurons  in  the  network,  was  proposed  by 
Grossberg  |4j.  Tlie  model  allows  the  output  of  each 
neuron  to  take  on  a  range  of  values  and  was  designed 
to  support  contrast  enhancement  and  competition.  The 
model  also  guarantees  that  the  final  output  of  each  neu¬ 
ron  is  a  function  of  the  relative  intensity  of  its  input  as 
a  fraction  of  the  total  input  provided  to  the  network. 

The  instantaneous  activity,  u,-,  of  each  neuron  Mi 
(»  =  1,. ..  ,N)  in  the  “on-center  o/f-aurround”  network 
is  described  by  a  differential  equation; 

u,  =  -AiMi  +  (Bi  -  CiUi)  (li  +  /i(ui)) 

fcsl  ' 


Here  i4i,  Bi,  Ci,  Di,  and  Ei  are  constants;  /<(  )  and 
gk{  )  nre  nonlinear  non-decreasing  functions;  and  /,-  is 
an  external  input  to  Mi-  is  the  weight  associated 
with  the  input  to  Mi  from  Mi,.  It  can  be  shown  that  this 
system  restricts  u;  in  such  a  way  that 


Bi 

Ci' 


For  our  deconvolution  network,  it  is  not  really  possi¬ 
ble  to  use  equation  3  directly  since  it  is  assumes  that 
the  external  inputs  /;  are  constant  for  at  least  the  time 
it  takes  the  network  to  converge.  To  write  an  equa¬ 
tion  that  is  similar  to  equation  3,  however,  we  define 
an  external  input  to  a  neuron  in  stage  t  to  be  any  in¬ 
put  that  does  not  originate  from  some  neuron  in  stage  s 
and  drop  the  requirement  that  the  inputs  be  constant. 
For  simplicity,  we  also  define  all  the  constants  to  be  the 
same  for  each  neuron  and  take  all  the  nonlinearities  to 
be  equal  to  the  same  sigmoid  function  (spatial  homo¬ 
geneity).  Specifically,  for  the  simulations  presented  in 
section  5, 


/.(*)  =  =  /(*)  = 

Following  Hopfield’s  notation  [5|,  X  represents  the  gain 
of  the  nonlinearity. 

Using  the  defined  in  equation  2,  the  differential 
equation  that  governs  the  instantaneous  activity  of  the 
neurons  in  a  deconvolution  network  with  5  stages  and 


N  states  in  each  stage  can  be  written  as 
=  -■4u,,y 


f 

+  (B  -  u.,, )  (  f(ui,j)  -1- 

+  »Tli,y,fc/(u<+i,a)]^ 
N  ,  N 

-  (c  -I-  ui,j)  53  ( /(ui.fc)  + 

fc/y'  1=1 


+  mi, I 


Equation  5  can  be  rewritten  more  compactly  as 
S  N 

*=1 /=i 

(6) 

where,  for  1  <  1  <  fV, 


Bi,y,<,i  =  1 

Bi,y,i— 1,1 

4 

Si,i,i+i,t  =  I2mi,,,| 

=  B  *  (7) 

Ti.,.ij  =  -C  V  Ijij 
Ti,y,i_i,i  =  —  G  53  mi_i,(,, 

^•.y.i+l.i  =  ~  C  m,,,,i 

If  fc  ^  {i  -  l,»,t  -bl},  then  5i,y,fc,i  =  Ti,j,kj  =  0. 

To  eliminate  the  need  for  global  interconnections 
within  a  stage,  we  can  add  summing  elements  to  cal¬ 
culate 

N  N 

Xi  =  '^f(xi,,)  and  Ji  =  53/..y 
y=i  y*si 


where 

N 

U,i  =  53  K-i,*.y/(tti_i,fc)  -I-  mi,y,fc/(«<+i,*)l  (8) 

*al 

Then  equation  5  can  be  rewritten  as 

Ui.y  ==  “^w<,y  +  (B  +  G)(/(ui,y)  -f-  /<,,)  —  Ui,y(Jfi  Ji) 

4  Stability  of  the  Network 

At  the  end  of  section  2,  the  desired  operation  of  a  sin¬ 
gle  stage  given  that  the  neighboring  stages  are  fixed 
is  described.  It  is  possible  to  show  that  in  this  situa¬ 
tion  a  single  stage  is  stable.  To  do  this,  fix  /(ufc,i)  for 
k  €  (t  —  l,t  -H  1}  so  that  equation  6  can  be  written  in 
the  same  form  as  equation  3; 

U,,y  =  -Aui,y  +  (B  -  Uy,y)  (/<,y  +  /(o,,y)) 

/  ^  ^  \  f9) 

kml  ' 


where  lij  is  defined  in  equation  8. 

Elquations  9  and  3  are  special  cases  of  the  more  gen¬ 
eral  nonlinear  system 


(^*  )  (  1**  (^*  )  ^ 

^  k=l 

where 

Oj(i,)  is  continuous  and  aj(ii)  >  0  for  i,-  >  0 
6j(xi)  is  continuous  for  i.  >  0 
Ci.k  =  Ck,i 

^  0  V  ii€  (—00,00) 

It  has  been  shown  [6]  that  a  system  that  can  be  writ¬ 
ten  in  this  form  has  a  global  Lyapunov  function  which 
can  be  written 


i=l-'0 

^  n  n 

+  o  E  E 

^ j=l k=l 


(10) 


and  that,  therefore,  such  a  system  is  asymptotically  sta¬ 
ble.  In  our  case,  this  means  that  a  single  stage  has  the 
desired  beliavior  when  the  neigliboring  stages  are  fixed. 

It  does  not  seem  possible  to  use  the  Cohen-Grossberg 
stability  proof  for  the  entire  system  in  equation  5.  Ex¬ 
tensions  of  3  and  10  also  seem  to  fall  short.  In  fact, 
Cohen  and  Grossberg  note  that  networks  which  allow 
cooperative  interactions  define  systems  for  which  no  sta¬ 
bility  proof  exists  |6|. 

In  anotlier  approach,  Hopfield  |5j  showed  that  a  net¬ 
work  with  simpler  feedback,  governed  by 

1 

^  j=i 

has  a  Lyapunov  function  of  the  form  (for  Vi  =  /((x,)) 

^ = - 5  E  + E  i:  fr'{v)dv  (11) 

Hopfield  argued  that  the  nonlinearity  can  be  normal¬ 
ized  so  that  we  can  write 

/-‘(v;)  =  A*< 

where  the  bar  denotes  the  normalized  function.  Then 
the  integral  in  equation  11  can  be  written 

f^i  1  fV, 

I  /-‘(K)dK=jy^  fr^(V)dV 

Hopfield  used  this  manipulation  to  argue  that  for  large 
gains  (A  — »  00),  the  second  term  in  equation  11  is  negli¬ 
gible  and  BO  the  network  of  analog  neurons  has  the  same 
equilibrium  points  as  a  network  of  discontinuous  on-olT 


A  possible  extension  of  equation  11  for  the  deconvo¬ 
lution  network  is 


^  =  -  5  E  Ti,j.k,tVi,iVkj 


-  E  “  E  Sij.k.,Vk,,)  P  '  r^(v)dv 

i.3  ^  k.i  ^  •'i 


The  time  derivative  of  E  is 
E 


(12) 


«.y  '  *,/ 

+  E  -  T  E  r  '  J-HV)dv) 

k.t  k,l  •'i  ' 


It  can  be  shown  that  for  /(i)  =  (l  e"^*)  *, 


(13) 


I~^(V)dV  =  B  <00 

In  this  deconvolution  network,  =  0  for  |t  — kj  >  1 

or  \j  —  l]  >  S,  so  there  are  no  more  than  35  terms  in  the 
summation.  Then,  in  the  limit  as  A  — »  00,  the  term  in 
parentheses  in  equation  13  converges  to  ti,  in  equation 
6.  Using  the  chain  rule,  we  can  write 


liin  E  = 

A-*oo 


It  can  also  be  shown  that 


(v;)  >  0 


for  all  V{,  and  this  implies  that 


lim  E  <  0  V  u. 

A-*  00 

Therefore,  for  large  gains,  E  as  defined  in  equation  12  is 
a  Lyapunov  function  for  the  system  described  by  equa¬ 
tion  5. 

We  can  apply  the  same  asymptotic  argument  to  the 
energy  function  in  equation  12  since  the  term  on  the 
second  line  of  the  equation  is  also  scaled  by  A.  This 
implies  that  the  equilibrium  points  in  this  network  in 
the  large  gain  limit  abo  correspond  to  the  equilibrium 
points  of  a  network  of  ducontinuous  on-olT  neurons.  For 
the  binary  neuron  case,  it  b  fairly  straight  forward  to 
show  that  the  energy  function  has  minima  at  the  desired 
decoder  outputs  if  we  assume  that  only  one  neuron  in 
each  stage  may  b  on  or  that  B  and  C  are  appropriately 
chosen  to  favor  thb.  This  bound  b,  however,  not  as 
tight  as  that  on  the  derivative  of  the  energy  function 
since  there  O(S'^N)  terms  in  the  summation  rather  than 
0(5)  as  above. 


neurons. 


Figure  2:  Evolution  of  the  trellis  network  for  unerrored 
input.  A  =  10.,  A  =  1.0,  B  =  1.0,  C  =  0.75,  T  =  0.02, 
input  is  all  teros.  The  initial  conditions  are  =  1., 
zj.j  =  0.0,  ii6,>  =  0.2,  all  other  =  0.0. 

5  Simulation  Results 

This  network  was  simulated  by  discretizing  equation  5 
using  Euler’s  method.  For  a  sampling  frequency  of  l/T, 
the  equation  of  the  updated  activity,  -f- 1),  is 

Ui,y(l+  1)  =  -  rAu.,y(t) 

fc=l  ^  ' 

-r(c  +  u,,,(0);f:(/Kfc(t)) 

1=1  ' 

The  simulations  presented  here  are  for  the  convolu¬ 
tional  code  illustrated  in  figure  1.  Since  this  code  has 
a  constraint  length  of  3,  there  are  4  possible  states  in 
each  stage  and  we  will  use  a  total  of  16  stages.  The  first 
and  last  stages  are  fixed  since  we  assume  that  we  have 
prior  knowledge  or  a  decision  about  the  first  stage  and 
zero  knowledge  about  the  last  stage.  The  transmitted 
codeword  is  assumed  to  be  all  seros. 

Figure  2  shows  the  evolution  of  the  network  over  2 
unit  time  intervals  with  T  =  0.02  when  the  received 
codeword  contains  no  noise.  The  output  of  each  stage 
is  a  vertical  set  of  4  curves.  The  upper-left  set  is  the 
output  of  the  first  stage;  the  upper-most  curve  is  the 
output  of  the  first  neuron  in  the  stage.  For  the  first 


Figure  3:  Evolution  of  the  trellis  network  for  input  with 
burst  errors.  The  input  is  000  000  000  000  000  000  000 
000  111  000  000  000  000  000  000.  The  constants  and 
initial  conditions  are  the  same  as  in  figure  2. 

stage,  the  first  neuron  has  a  fixed  output  of  1  and  the 
other  neurons  have  a  fixed  output  of  0.  The  outputs  of 
the  neurons  in  the  last  stages  are  fixed  at  an  interme¬ 
diate  value  to  represent  zero  a  priori  knowledge  about 
these  states.  Notice  that  the  network  reaches  an  equilib¬ 
rium  point  in  which  only  the  top  neurons  in  each  state 
(representing  the  “00’  node  in  figure  1)  are  on  and  all 
others  are  oft.  This  case  simply  illustrates  that  the  net¬ 
work  can  correctly  decode  an  unerrored  input  and  that 
it  does  so  rapidly,  i.e.,  in  about  one  time  constant. 

One  of  the  more  dilllcult  decoding  problems  for  this 
network  is  the  correction  of  a  burst  of  errors  in  a  tran¬ 
sition  subsequence.  Figure  3  shows  the  evolution  of  the 
network  when  three  errors  occur  in  the  transition  be¬ 
tween  stages  9  and  10.  Note  that  10  unit  time  inter¬ 
vals  are  shown  since  complete  convergence  takes  much 
longer  than  in  the  first  example.  However,  the  network 
has  correctly  decoded  many  of  the  stages  far  from  the 
burst  error  in  a  much  shorter  time. 

If  the  received  codeword  contains  scattered  errors,  the 
convolutional  decoder  should  be  able  to  correct  more 
than  3  errors.  Such  a  case  is  shown  in  figure  4  in  which 
the  received  codeword  contains  7  errors.  The  system 
takes  longest  to  converge  around  two  transitions,  5-6 
and  11-12.  The  first  is  in  the  midst  of  consecutive  sub¬ 
sequences  which  each  have  one  bit  errors  and  the  second 
transition  contains  two  errors. 

To  illustrate  that  the  energy  function  shown  in  equa¬ 
tion  12  is  a  good  candidate  for  a  Lyapunov  function  for 
this  network,  it  is  plotted  in  figure  5  for  the  three  cases 


Figure  4:  Evolution  of  the  trellis  network  for  input  with 
distributed  errors.  The  input  is  000  010  010  010  100  001 
000  000  000  0001 10  000  000  000  000.  The  constants 
and  initial  conditions  are  the  same  as  in  figtire  2. 
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Figure  5:  The  energy  function  defined  in  equation  12 
evaluated  for  the  networks  whose  outputs  are  shown  in 
figures  2,  3,  and  4. 


described  above.  The  nonlinearity  used  in  these  simu¬ 
lations  has  a  gain  of  ten,  and,  as  predicted  by  the  large 
gain  limit,  the  energy  decreases  monotonically. 

0  Discussion 

We  have  presented  a  network  which  minimises  a  func¬ 
tion  that  is  analogous  to  the  log  likelihood  function. 
The  results  of  several  simulations  demonstrate  that  the 
network  can  successfully  decode  noisy  input  vectors. 
Other  simulations  with  different  codes  have  supported 
these  results. 

Section  4  illustrates  that  a  proof  of  the  stability  of 
*on-center  off-surround”  networks  can  be  applied  to 
show  that  each  stage  will  maximise  the  desired  local 
‘likelihood”.  The  same  section  also  showed  that,  in  the 
large  gain  limit,  the  network  u  a  whole  is  stable  and  the 
the  equilibrium  points  correspond  to  the  MLE  decoder 
output. 

Our  network  is  distinguislied  by  the  fact  that  con¬ 
nections  are  localised.  Although  each  stage  has  been 
assumed  to  be  globally  interconnected,  with  the  addi¬ 
tion  of  simple  summing  or  averaging  elements  this  b  not 
necessary.  Each  element  would  receive  the  sum  of  the 
outputs  of  that  stage  rather  than  each  individual  value. 
Connections  between  elements  in  different  stages  can 
occur  only  if  the  stages  are  adjacent.  Given  the  struc¬ 
ture  of  a  rate  1/n  code,  no  element  can  be  connected 
to  more  than  2  elements  in  each  of  the  two  neighboring 
stages. 
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Ahsiraci  — .Associative  memory  networks,  consisting  of  highly  intercon¬ 
nected  binary-valued  cells,  have  been  used  to  model  neural  networks.  Tight 
asymptotic  bounds  have  been  found  for  the  information  capacity  of  these 
networks.  We  derive  the  asymptotic  information  capacity  of  these  net¬ 
works  using  results  from  normal  approximation  theory  and  theorems  about 
exchangeable  random  variables. 

I.  Introduction 

For  many  years  researchers  in  various  disciplines 
have  studied  models  for  the  brain.  Many  models  have 
been  developed  in  attempts  to  understand  how  neural 
networks  function.  One  class  of  such  models  is  based  on 
the  concept  of  associative  memory  [l]-[27].  Associative 
memories  are  composed  of  a  collection  of  interconnected 
elements  having  data  storage  capabilities.  The  elements  are 
accessed  in  parallel  by  a  data  probe  vector  rather  than  by  a 
set  of  specific  addresses  (14). 

Recent  years  have  seen  interest  increasing  in  the  model¬ 
ing  of  neural  networks  for  possible  applications  to  com¬ 
puter  architectures.  Associative  memory  network  (AMN) 
models  of  one  particular  form,  consisting  of  highly  inter¬ 
connected  threshold  devices  [1],  [2],  (5]-[12],  [24]-[27]  have 
received  much  attention.  These  models  are  sometimes  re¬ 
ferred  to  as  binary  associative  memory  networks 
(DAMN’S). 

This  paper  discusses  some  analytical  aspects  of  the 
BAMN  models.  Specifically,  we  analyze  the  storage  capa¬ 
bilities  of  these  models.  We  consider  the  case  where  cells 
can  take  on  only  one  of  two  values  {-1,1},  Our  work  is 
motivated  by  a  desire  to  understand  better  the  results  of 
[12],  where  various  elaborate  arguments  are  used  to  find 
the  asymptotic  value  of  the  network  storage  capacity.  In 
contrast,  we  determine  the  asymptotic  network  storage 
capacity  by  applying  normal  approximation  theory  and 
theorems  about  exchangeable  random  variables.  This  new 
approach  contributes  to  a  better  understanding  of  the 
results  and  provides  a  means  of  extending  the  analysis  to 
more  general  AMN  models. 
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In  Section  II  the  standard  BAMN  model  and  the  notion 
of  capacity  of  the  network  are  described.  The  operation 
and  the  construction  of  the  standard  model  are  discussed. 
In  this  model  each  updating  operation  on  a  cell  is  per¬ 
formed  by  thresholding  a  linearly  weighted  sum  of  other 
cell  values.  If  the  threshold  is  exceeded  the  cell  takes  on  a 
1  value;  otherwise,  it  takes  on  a  -1  value.  A  network  is 
characterized  by  a  matrix  of  weights  that  determines  the 
strengths  of  the  interconnections  between  different  cells. 
The  weight  matrix  is  constructed  from  a  sum  of  outer 
products  of  vectors  chosen  to  be  the  desired  “codewords” 
to  be  stored  by  the  network. 

The  information  capacity  of  the  standard  model  is  de¬ 
rived  in  Section  III.  This  requires  a  formalization  of  the 
notions  of  capacity  and  stability.  Following  [12]  we  con¬ 
sider  two  different  definitions.  In  the  first,  capacity  is 
related  to  the  maximum  number  of  codewords  that  can  be 
used  to  construct  the  AMN  while  maintaining  a  fixed 
codeword  as  a  stable  vector.  In  the  second  definition,  the 
capacity  is  related  to  the  largest  number  of  codewords  that 
can  all  be  stored  as  stable  vectors  in  the  network.  We  also 
consider  the  radius  of  attraction  of  each  of  these  code¬ 
words.  For  example,  if  the  state  of  the  AMN  after  a  few 
update  operations  converges  to  a  given  vector  for  all  initial 
probe  vectors  at  a  Hamming  distance  of  K  or  less,  then  the 
given  stable  vector  has  a  radius  of  attraction  of  at  least  K. 
Proofs  of  the  various  results  are  given  in  the  Appendices. 
They  involve  normal  approximation  theory  and  theorems 
from  exchangeable  random  variables.  Finally,  in  Section 
IV  we  summarize  the  main  theoretical  results  of  this  paper 
and  introduce  extensions  for  further  research. 

II.  Operation  and  Construction  of  the  Binary 
Associative  Memory  Network  Model 

In  this  section  we  discuss  the  AMN  model  presented  in 
[1],  [2],  sometimes  referred  to  as  the  binary  associative 
memory  network.  A  network  consists  of  cells  { Y,  ),  1  ^  i  ^ 
N,  with  each  cell  taking  on  one  of  the  values  {-1,1}.  Each 
cell  affects  all  other  cells  through  an  interconnection  or 
weight  matrix  T.  The  interconnection  matrix  is  symmetric 
with  0  values  on  its  diagonal.  Each  cell  is  updated  at 
random  with  the  update  events  forming  a  Poisson  process 
with  rate  At  each  update  the  linearly  weighted  sum  of 
all  other  cell  values  is  compared  to  a  given  threshold.  If  the 
weighted  sum  exceeds  the  threshold,  the  cell  takes  on  a  1 
value;  if  not,  the  cell  takes  on  a  - 1  value.  We  assume  that 
the  updating  processes  of  all  cells  are  independent,  so  that 
the  total  number  of  updates  is  a  Poisson  process  with  rate 
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NX.  Using  a  counter  k  that  is  incremented  every  time  any 
cell  is  updated,  an  update  of  cell  /  at  time  ic  + 1  is 
described  by  the  equation 


where 


J,(A:  +  l)=|a 


i:  X^{k)niJ),XXk) 

j*‘ 


X  >  0 

x  =  0. 

X  <  0 


(2.1) 


Here  w-e  have  chosen  a  0  threshold. 

For  an  AMN  with  interconnection  matrix  T,  we  define  a 
binary  N  vector  V  to  be  invariant  if,  when  V  is  input  into 
the  network,  all  updates  leave  the  state  of  the  network 
unchanged.  An  invariant  vector  is  also  called  a  stable 
vector  of  the  AMN.  The  set  of  all  stable  vectors  is  denoted 
by  M. 

Now  consider  the  construction  of  an  AMN,  a  process 
that  can  be  viewed  as  learning.  The  T  matrix  is  con¬ 
structed  so  that  certain  vectors  are  stored  in  the  network. 
A  vector  is  successfully  stored  in  the  network  if  it  can  be 
retrieved  by  an  appropriate  data  probe  vecior.  We  let  V(i), 
I  ^  i  <  m  be  the  codewords,  binary  N  vectors,  used  to 
construct  the  T  matrix.  The  desired  behavior  of  the  model 
when  some  vector  V  is  input  into  the  network,  i.e.,  when 
the  network  is  initialized  at  V,  is  that  after  a  few  updates, 
the  state  of  the  network  should  become  V,  a  stable  vector 
which  is  close  to  V  in  Hamming  distance.  Several  tech¬ 
niques  can  be  used  to  construct  the  T  matrix  of  the  AMN. 
Here  we  use  a  simple  technique  involving  correlation, 
which  contrasts  with  techniques  using  eigenvectors  and 
orthogonal  learning  approaches  shown  in  [14],  [18],  [24]; 
the  latter  are  more  complicated  to  implement.  The  correla¬ 
tion  technique  constructs  the  T  matrix  from  {F(t)}  as 
follows.  Let 


T,  =  V{i)V{if  - 1,  l^i^m  (2.2) 
and  then  take 

r=  f  7].  (2.3) 

/-I 

Hopefully,  all  of  the  chosen  codewords  {F(t)}  will  be 
stable  vectors  of  the  network;  however,  this  cannot  occur 
when  m  becomes  too  large  in  comparison  to  N.  The  set  of 
all  codewords  that  are  stable  vectors  is  called  Thus 
c  M,  but  M  also  contains  the  “one’s  complement”  of 
vectors  in  M^.  and  possibly  other  vectors  which  we  call 
spurious  stable  vectors. 

To  find  the  capacity  of  these  networks,  random  coding 
arguments  are  used;  each  component  of  each  codeword  is 
assumed  to  be  chosen  independently  of  all  other  compo¬ 
nents,  with  the  probability  of  a  1  or  - 1  each  equal  to  1  /2. 
Then,  given  m  randomly  chosen  codewords,  one  can  find 
the  probability  that  any  codeword  or  that  all  the  chosen 
codewords  are  members  of  M.  Two  definitions  for  capac¬ 
ity  which  we  call  m(<)  and  m{i)  are  introduced  in  [12]. 
Before  presenting  these  definitions,  we  define  a  syn¬ 
chronous  update  A(F)  as  a  simultaneous  update  on  all  N 


cells  with  V  initially  input  into  the  network.  It  is  easily 
shown  that  if  K  is  a  stable  vector  then  V=\{Vy,  the 
converse  is  not  necessarily  true.  This  stronger  condition  is 
used  in  defining  m(«)  as 

ni{t)  =  max w3Pr(F(/c)=A(F(it)))>l-t  (2.4) 
(where  we  may  take  k=\)  and  m(t)  as 
m{(.)  =  max  m  3  Pr(F(j )  =  A(k'(/  )).  1  s  /  ^  m )  >  1  -  e. 

(2.5) 

We  show  in  Section  111  that  m{t)~N/l\ogN  and  that 
m{t)  =  N/^\ogN  for  any  £>0  and  for  N  sufficiently 
large;  these  results  were  obtained  by  different,  more  te¬ 
dious  methods  in  [12]. 

We  first  present  a  simple  example  to  help  visualize  how 
these  AMN  models  work.  Take  ,V  =  4  and  m  =  3.  and  let 


‘r 

1  ■ 

1  ■ 

F(l)  = 

1 

1 

F(2)  = 

-1 

1 

F(3)  = 

1 

1 

.1. 

.-1. 

.  -1 . 

Using  the  correlation  method,  T  is  easily  found  to  be 

■  0  1  3  -T 

7-^10  1  1 

31  0  -1  ■ 

.~1  1  -1  0. 

We  note  that  only  K(3)  e  Af,  (i.e.,  M^  =  {K(3)}).  Fig.  1 
shows  a  diagram  of  this  model. 


Fig.  1.  Example  network  with  edges  representing  interconnection 
weights  and  nodes  representing  cells. 


Before  concluding  this  section  we  note  that  in  our  analy¬ 
sis  we  always  assume  that  any  initial  state  will  converge  to 
a  stable  vector.  This  was  justified  by  Hopfield  in  [2]  by 
noting  that 

E  =  -^-ZZT{iJ)X,{k)XXk).  k>0  (2.6) 

'  j 

is  a  monotonic  decreasing  function  of  the  update  counter  k 
and  that  the  elements  of  M  correspond  to  the  local  minima 
of  E.  Therefore,  any  initial  state  will  converge  to  a  stable 
vector. 

III.  Derivation  of  the  Information  Capacity 
FOR  THE  BAMN  Model 

This  section  studies  the  capacities  ni(€)  and  rfi(£)  using 
results  from  normal  approximation  theory  and  theorems 
about  exchangeable  random  variables.  Our  main  results 
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are  asymptotic  expressions  for  these  quantities.  We  discuss 
the  two  patterns  for  convergence  of  initial  states  to  some 
stable  vector.  We  also  consider  the  error-correcting  capa¬ 
bilities  or  the  radius  of  attraction  of  each  codeword.  Much 
of  the  original  discussion  about  capacity,  convergence,  and 
radius  of  attraction  can  be  found  in  [12]  and  (24).  The  key 
differences  in  our  approach  versus  that  of  (12)  are  the 
proofs  of  the  main  theorems;  these  are  given  in  the  appen¬ 
dices. 

Associated  with  each  cell  value  we  define  the  interaction 
strength  (IS)  of  cell  j  for  codeword  k  as 

u(j.k)=  y  T{j\i)v{k)=  y  y  vjnyiDKik) 

I  *  j  ;  *  y  /  —  1 

=  (A/-1)K/A:)+  £  I  v,(/)y//)y,(k). 

I  *  /  /  *  It 

(3.1) 


According  to  the  standard  model,  when  a  cell  is  updated 
its  next  value  is  determined  from  a  comparison  of  its  IS 
with  a  threshold  value.  Using  the  random  coding  model 
described  in  the  previous  section  U(j,k)  is  a  random 
variable.  We  transform  this  random  variable  by  letting 

uij.k)-  E  E  KU)yj{i)Kii<)v^ik)=  Hujii) 

t  *  j  i  *  k  i  ^  k 

=^[uij,k)-{N-\)v^ik)]y^ik)  (3.2) 

where  we  call  u{j,k)  the  normalized  interaction  strength 
(NIS).  The  Uj{l)  for  are  random  variables  having 
probability  mass  function  identical  to  a  shifted  binomial 
random  variable  with  mean  0,  Af  - 1  points,  and  parameter 
p  =1/2.  Since  V{1)  are  chosen  independently  for  all  /,  the 
u(j,k)  are  random  variables  having  probability  mass 
function  identical  to  a  shifted  binomial  random  variable 
with  mean  0,  (A-l)(m-l)  points,  and  parameter  p  = 
1/2. 

To  evaluate  m(e)  we  need  to  compute 
p  =  Pr(U(A:)  =  A(F(A:))) 

=  Pr|  n  {«(;,*)  +  (3.3) 

(where  we  may  take  k  =  l).  To  evaluate  m(t)  we  need  to 
compute 

p  =  PT{V{k)=^{Vik)),l^k^m) 
n  (1  {u{j,k)^-N  +  l} 

j-l  k-i 

The  major  stumbling  block  to  analyzing  (3.3)  and  (3.4) 
is  the  fact  that  the  u(j,k)  are  not  independent.  In  fact,  it 
is  easily  shown  that 


E[u(j.k)u(l.m)} 


I  w  -  1 . 

I  -V-1, 
U, 


j  =  I,  k  =  m 
j  k  =  m 
j  =  L  k  ^  m 
otherwise. 

(3.5) 


By  the  Demoivre- Laplace  theorem  [28],  for  large  ,V. 
u(J,k)  converges  in  distribution  to  a  Gaussian  random 
variable  with  the  same  first-  and  second-order  moments. 
Thus  we  let  g{j,k)  be  a  Gaussian  random  variable  with 
the  same  first-  and  second-order  moments  as  u(  j.  k )  and 
investigate  the  quantities 

^g  =  p4  n  {g(;,-^)>-A^  +  l})  (3.6) 


= Pr  I  n 

Let  Q{x)  =  dx  be  the  standard  normal 

error  function  and  /(■)  be  the  standard  indicator  function. 
In  Appendix  I  we  develop  and  use  the  theory  of  exchange¬ 
able  random  variables  to  show  the  following  resu.t. 


n  {g(/.A:)>-^  +  l}  .  (37) 


:  -1 


Theorem  1:  1)  If 
N 

m  < - 

log 


E  ^(g(;.^)s-Af) 


and  Ty  is  a  Poisson  random  variable  with  parameter 
=  NQ(^N/m  ),  then  -»  in  distribution. 

2)  If 

^  N  m 

and  is  a  Poisson  random  variable  with  parameter 
Xj  =  NmQ{^N/m ),  then  -*  in  distribution. 


We  can  use  this  theorem  to  evaluate  the  capacities 
defined  earlier.  Letting  m  =  N/a  log  N  we  solve  for  the 
constant  a  for  both  cases.  From  the  theorem,  for  large  N 
we  have 


and 


—  g~NmQI^N/m) 


(3.8) 

(3.9) 


By  repeated  integration  by  parts  we  obtain  the  expansion 


Qi^) 


1+  I 

i-l 


0(27-1) 

j-i 


Hence,  for  large  x,  we  have  the  approximation  0(x)  = 
e~ x^jlit .  Then  we  have  the  following  result. 

Theorem  2:  For  case  1)  in  Theorem  1,  assuming  p  = 

Pq  »  e~^,  and  as  Af  -*  oo, 

m(l-  Pc) 

N 

“  riT\ 

21ogA^-21og  log  —  -log(logAf)-log4ir 


N 

2  log  W 


(3.10) 


J 
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For  case  2)  in  Theorem  1,  assuming  p  ^  Pc>  Pc^  ^ 
and  as  -»  oo, 

m(\-  Pc) 

N 


4  log  iV- 2  log  log  —  -31og(log7V)-logl28iT 


N 

4  log  N 


(3.11) 


Next  we  consider  conditions  under  which  Pc~*  P  and 
Pc  -*  p-  Appendix  II  shows  that  when  N  is  sufficiently 
large  and  m  =  N/a  log  N  for  1  /2  <  a  ■«  N/log  N  then 
Pc  ^  P-  This  is  a  consequence  of  normal  approximation 
theory.  It  can  similarly  be  shown  that  Pc  -*  p-  Therefore, 
for  any  £  >  0  we  can  find  an  N  large  enough  such  that 
m((.)  ~  N/l\ogN  and  w(£)  =  N/A\o%N. 

In  Fig.  2  we  plot  some  simulation  results  for  p  zs  z. 
function  of  m/N\  we  also  plot  the  theoretical  graph  of  Pc 
versus  m /N.  For  Af  =  16  the  simulations  and  the  Gaussian 
approximations  differ  substantially,  but  for  N  =  256  the 
simulations  and  the  Gaussian  approximations  are  almost 
identical.  The  simulations  were  performed  by  choosing 
random  codewords,  updating  the  T  matrix,  and  then 
checking  if  K(l)  e  M.  This  process  is  continued  until  the 
number  of  codewords,  m  =  0{N).  This  gives  a  simulation 
of  one  sample  network  and  is  shown  in  Fig.  3.  From 
Monte  Carlo  simulations  of  several  sample  networks,  the 
value  of  p  for  a  given  m  is  easily  found. 


Fig.  2.  Simulations  comparing  theoretical  Gaussian  approximation  pc 
to  Monte  Carlo  simulations  of  p  iot  N  16,  N  -(A,  and  N  -  256. 


To  this  point  our  assumption  has  been  that  we  input  a 
codeword  with  no  errors  and  then  calculate  the  probability 
that  the  state  of  the  system  does  not  change  after  any 
update.  In  AMN  we  are  also  interested  in  recovering 
stored  patterns  even  when  some  information  about  the 
data  is  lost.  For  a  fixed  AMN,  we  say  that  a  codeword  V 
has  K  error  correcting  ability  if  all  vectors  V  within 
Hamming  distance  K  are  correctable  by  one  synchronous 
update,  that  is,  A(K).  We  can  then  evaluate  the  capac¬ 
ity  of  the  network  if  we  require  all  codewords  to  be  K 
error  correcting  with  high  probability  or  if  we  require  a 


Fig.  3.  Algorithm  used  for  conducting  simulations  of 


typical  codeword  to  be  K  error  correcting.  We  define  these 
two  capacities  as  w(£,  K)  and  m(£,  K),  respectively. 

Using  the  same  type  of  arguments  as  in  the  first  part  of 
this  section,  we  first  find  the  NIS: 

u{j,k,V)  =  Z  I  V,{l)V^{l)VXk)til^)= 

i*  j  t*k  l*k 

(3.12) 

From  (3.2),  we  note  that  u(j,k,V)  has  the  same  distribu¬ 
tion  as  u(j,k).  Observe  that 

N-\-2K=  Z  V,{k)Vj{k)y.ik)V^{k)  (3.13) 

‘*J 

when  h{V(k),V(k))  =  K  (where  h{x,  y)  is  the  Hamming 
distance  between  x  and  y).  We  then  define  the  quantities 

p{K)  ^?i(y{k)  =  Hv)\h{v{k),v)  ^  k) 


=  Pr 


I  n  {«(y,A:,F)S:-A^  +  l  +  2A:} 


\^k<m  (3.14) 
and 

p{K)  =  Pr{y{k)  =  ^{y{k))\h{y{k),yik)) 

^  K,l^k<,m) 

(/V  nt 

n  n  {«(y.A:,K)^-A^  +  l+2A:} 

j-ik-i 

(3.15) 


in  analogy  to  (3.3)  and  (3.4).  By  working  with  the  corre¬ 
sponding  Gaussian  quantities  pdK)  and  Pc{K),  Theo¬ 
rem  A2  and  normal  approximation  theory  give  the  follow¬ 
ing  result. 


Theorem  3:  Asymptotically,  zs  N  -*co  and  for  all  K  < 
N/1 


"»x(0 


{N-lKf 

2N\oiN 


(3.16) 
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and 


IV.  Summary  and  Discussion 


{N-IKY 
4N  log  .V 


(3.17) 


An  AMN  can  therefore  be  expected  to  have  storage  capa¬ 
bilities  and  an  error  correcting  ability  for  any  K  <  N/2. 

Before  concluding  this  section,  we  discuss  some  very 
simple  arguments  that  can  be  used  to  show  that  the  w(t) 
is  at  least  N/1  log  N.  By  using  subadditiviiy  of  probability 
measures,  we  can  find  a  lower  bound  on  w(«).  Note  that 


/  .V 


1  -  p  =  Pr 


IJ  (u(/,w)  <  -  A'  +  l) 

\7-l 


s  E  Pr(u(j./c)  <  -N  +  l) 

/-I 

=  VPr(u(l,/c)<-,V-(-l). 


(3.18) 


Using  the  De  Moivre- Laplace  theorem  [28],  for  large  N, 
Pr(  u(l.  /: )  <  -  tV  1)  *  Q{\jN/m  ).  Therefore,  as  N -*qo, 
if  the  number  of  codewords  m  is  no  larger  than  iV/21og  N, 
then  p  -*1.  This  lower  bound  on  the  probability  is  rela¬ 
tively  tight  for  m  s  A^/21og  V  but  quite  poor  for  values  of 
m  that  are  larger.  Using  Bonferroni’s  inequalities  [28],  we 
can  also  upper-bound  ?n(()  by  lower-bounding  the  error 
term  as 


,V(/V-1) 

1-  p>  -VPr(u(l.fc)  <-N  +  \) - - - 

•Pr(u(l,fc)  <-  N  +  l,u{2,k)  <-N  +  l}.  (3.19) 

Again,  this  bound  is  relatively  tight  for  m  <  A^/21ogAf  but 
blows  up  for  values  of  m  that  are  larger.  Fig.  4  shows  a 
plot  of  Monte  Carlo  simulations  of  1  —  p  versus  m  when 
,V  =  64.  Approximate  upper  and  lower  bounds  correspond¬ 
ing  to  the  inequalities  in  (3.18)  and  (3.19)  and  computed 
from  normal  approximations  are  also  shown. 


Fig  4  Simple  bounds  using  Gau.ssian  approximations  for  1  -  p  Lower 
bound  involves  calculation  of  univariate  (iaussian  distributions  and 
upper  bound  involves  calculation  of  biv.inate  Gaussian  distributions. 
Case  considered  is  N  =  64  and  Monte  Carlo  simulations  arc  used. 


This  paper  has  studied  the  information  storage  capacity 
of  associative  memory  network  models,  in  particular  the 
BAMN.  Using  normal  approximation  theory  and  theorems 
from  exchangeable  random  variables,  we  proved  some 
asymptotic  results  about  the  capacity  of  the  network.  These 
theoretical  results  were  compared  to  simulations;  for  .V  > 
64  the  theoretical  and  simulation  results  compare  quite 
favorably. 

In  proving  the  asymptotic  capacity  of  the  BAMN.  we 
introduced  the  Gaussian  random  variables  {gij.k).  \  <  j 
<m,  \  <k<,N)  which  have  the  same  first-  and  second- 
order  moments  as  [u{j,k),  \<k<N).  We 

then  showed  that  both  the  set  of  events  {[g[j,k)  <  x^]. 
\<k<N)  and  {{g(y,  k)  <  x^j),  1  s  y  <  m,  1  <  ^  <  ,V  } 
satisfied  Theorem  A2,  where  x  v  is  some  prespecified  set  of 
numbers.  By  using  normal  approximation  theory,  we  sub¬ 
sequently  showed  that  -» p.  It  can  be  similarly  shown 
that  Pc  P-  Kmowing  p  and  p  the  capacity  is  easily 
found  by  using  definitions  in  Section  II  and  III.  A  more 
direct  proof  would  be  to  show  that  both  the  set  of  events 
{[uij,k)<Xf^),  l^k^N)  and  {{«(y,  ^)  <  x,v}.  Isy 
^w,  \  ^k  ^N)  satisfy  Theorem  a1  We  conjecture  that 
this  is  indeed  true,  and  proving  it  is  a  topic  for  further 
research. 

Another  direction  for  further  research  is  to  consider 
more  ubiquitous  AMN  models.  It  would  be  desirable  for 
these  models  to  retain  much  of  the  simple  structure  of  the 
BAMN,  while  incorporating  such  features  as  random  up¬ 
date  operations,  spatially  varying  interactions,  and  more 
complex  learning  algorithms.  In  [27],  studies  of  AMN 
architectures  with  these  features  are  described.  Using  the 
techniques  developed  in  this  paper,  we  have  found  asymp¬ 
totic  bounds  for  the  capacity  of  some  random  update  and 
spatially  varying  models.  Simulations  have  confirmed  the 
validity  of  analytical  work.  One  key  result  that  we  have 
shown  is  the  following:  for  a  class  of  homogeneous  spa¬ 
tially  varying  models,  the  capacity  of  the  network  is  di¬ 
rectly  proportional  to  the  interconnectivity  of  the  network. 
Present  research  is  focusing  on  the  determination  of  the 
capacity  of  various  AMN,  subject  to  constraints  on  net¬ 
work  interconnectivity. 


Appendix  I 

This  Appendix  discusses  some  basic  results  concerning  ex¬ 
changeable  random  variables  and  proves  Theorem  1  of  Section 
III.  Much  of  the  discussion  comes  directly  from  [30]-[33].  Theo¬ 
rem  A1  is  from  [30]  and  Theorem  A2  is  from  [32]. 

The  problem  of  interest  involves  finding  the  distribution  of  the 
minimum  of  a  number  of  random  variables;  we  therefore  begin 
our  discussion  with  this  problem.  Let  {A",,  1  s  (  s  iV }  be  identi¬ 
cally  distributed  random  variables  with  the  same  marginal  distri¬ 
bution  as  X.  To  find  the  distribution  of  Z  -  min  ( X, ),  we  note 
that 

Pr(Z^2)-l-Pr(X,  >r,l:Si<Af). 

If  the  {  Y, }  are  independent,  then 

Pr(Zsz)  -l-(Pr(  A'>2))\ 
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Let  us  consider  a  weaker  condition  on  the  { A] }.  We  say  that 
random  variables  {  -V, ,  1  <  i  <  ,V  )  are  exchangeable  if  their  joint 
distribution  is  invariant  under  permutations  of  the  random  vari¬ 
ables.  We  also  define  events  { C, ,  1  <  /  <  /V }  as  exchangeable  if, 
for  all  choices  of  indices  1  <  i[  <  ■  ■  •  <  ij,  ^  M,  we  have 

Pr(C,_,C„,  -,C,J  =a^,  l^k<N.  (A.l) 


Note  that  depends  only  on  k  and  not  on  the  indices  ij.  will 
be  referred  to  as  the  kth  De  Finetti  constant  with  ag  “I-  Denot¬ 
ing  the  event  {  A",  >  x }  by  C, ,  if  {  A", ,  1  ^  i  <  N }  are  exchangeable 
random  variables,  then  { C, ,  1  <  i  <  iV }  are  exchangeable  events, 
De  Finetti  [34]  proved  an  important  theorem  relating  exchange¬ 
able  random  variables  to  conditional  distributions. 


Theorem  Al:  Let  A{N)  be  the  number  of  C,  =  {  A’,  >  x}  that 
occur  for  exchangeable  random  variables  A', ,  1  s  i  <  Af.  Then 


f  =  lim 

.V-o) 


N 


exists  almost  surely,  and  if  it  (a  probability  measure  on  the 
interval  [0,1])  is  the  distribution  function  of  then 


=  /"  x‘w  ((ix),  k  >  0. 


This  result  was  later  extended  by  several  people  [35]-[36],  giving 
the  following  corollary. 

Corollary:  Given  the  above  conditions  there  exists  a  random 
variable  X  such  that 


where  IF  is  a  random  variable  with 

Pr(lF=y)  =u)^.  (A. 7) 

If  we  can  find  the  distribution  of  W,  we  can  find  the  order 
statistics  of  {  A] ,  1  <  /  <  ,V  }  for  N  finite.  This  distribution  is  not 
easy  to  determine,  but  Galambos  [32]  and  Kendall  [30]  have 
obtained  some  results  for  the  limiting  case  under  some  mild 
restrictions. 

Theorem  A2:  Let  {A[,  l<(<n}  be  random  variables  with 
corresponding  distribution  functions  {  /[(x),  1  <  i  <  n  }.  Let  (x„ } 
be  a  sequence  of  real  numbers  such  that 

n 

lim 

with  0  <  6  <  00 .  Setting 

"*(")=(  )  L  Pr(  K,  ^  x,,.  ,  K,  S  .x„  ) 

where  the  sum  is  over  all  indices  1  <  /'i  <  (j  <  ■  ■  ■  <  (*  <  n.  As¬ 
sume  that  for  n  >  tig  there  exists  a^(n),  n<  j  <  M„  such  that 
sequence  n^(tt),  j<  can  be  associated  with  a  set  of  \f„ 
exchangeable  events.  If  -  oo  for  ti  >  tig  or  if  both  M„/n  <x 
and  n  -» 00  with 


lim  n^a2{n)  =  6^, 

n  -•  30 


Pr(C,.-.,CjX)={* 
where  |  =  A  almost  surely. 

The  above  results  deal  with  the  limiting  case  as  Af  -*  oo,  but  we 
are  concerned  primarily  with  finite  N.  Kendall  [30]  generalized 
the  above  theorems  for  N  finite.  As  a  simple  analogy,  W  =•  oo  can 
be  viewed  as  picking  marbles  from  a  collection  of  marbles  in  an 
urn  and  replacing  the  picked  marbles,  whereas  N  finite  can  be 
viewed  as  performing  the  same  operation  without  replacement.  If 
we  let  6(o*)  =  a*  -  k  N  and  S-'fa*)  “ 8(5''"*(a*)) 

then 


I!  (  ^  ^  (A.2) 

If  we  let  u,  =  I  for  0  ^  j  ^  N,  then  w  is  a  probabil¬ 

ity  distribution  since 

8'a^_,i0,  OsrsAf  (A.3) 

and 


Then  we  have 


for  0  5  w  ^  A7  and 


N 

E  “r = 1  • 

r-0 


Pr(C„,C,^,--,CJlF) 


(A.4) 


(A.5) 


(A.6) 


then  for  any  j, 


lim  Pr(X; 

n  -•  00 


L  — 

k-O 


where  X*  is  the  y  th-order  statistic. 

The  above  theorem  has  the  following  intuitive  interpretation. 
We  construct  exchangeable  events  from  some  set  of  ii  ran¬ 
dom  variables  with  M„ »  n.  If  the  events  are  nearly  pairwise 
independent,  then  /  of  these  events  are  nearly  jointly  independent 
for  2  i  ^  ft.  A  proof  of  this  theorem  presented  in  [30]  is  based 
on  finding  the  characteristic  function  of  K„  =  E".  1  /( A]  <  x„ ) 
and  approximating  this  with  the  characteristic  function  of  a 
Poisson  random  variable  with  parameter  b.  The  approximation 
uses  Chebyshev’s  inequality  and  depends  only  on  the  quantities 
\na^{n)-  b\  and  ]n^a2(fi)-  i>^|  and  not  on  b. 

In  the  problem  of  interest  in  Section  III,  for  some  N  and  m  we 
want  to  find  Pr  (min  (g(i,  _/'),  l^i^N)<-N  +  l)  and 
Pr(min(g(i,  y),  1  ^  i  <  N,  1  s  y  5  fw)  <  -  N  +  1)  where  the 
{8{‘<J)}  SJ’C  Gaussian  exchangeable  random  variables.  We  first 
normalize  these  random  variables,  obtaining  random  variables 
j)]  with  the  following  second  moments; 


1, 

1 

fi-1  ’ 


E[G„{i,j)G„ik,l)]  =  { 


1 

m-1  ’ 


1 


/  =  k ,  y  =  / 

i*k.  j  =  l 
i~k,  j  *l 


otherwise . 


We  want  to  find  the  values  of  m  where  Theorem  A2  can  be 
applied  to  evaluate  the  two  problems  of  Section  III.  To  use 
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Theorem  A2  to  evaluate 

Pr(min(  g(i,  j),  I  <  i  <  /V)  <  -  +  1) 

=  Pr(inin(<Jv('.  y),  1  <  i  5  A/)  <  x.v), 

the  following  two  conditions  must  hold  for  all  n  >  N: 

1)  lim,,^^  n-a2(u)  =  h-; 

2)  there  exists  W,,  such  that  M„/n-*ao  as  n -*  oo  with  the 
set  { G„(i,  j).  !</<«}  augmented  to  1  S  i  <  elements 
so  that  all  C„(/,  j)  are  still  exchangeable. 

Here  h  is  determined  by  fixing  N  and  setting  b  =  NQ{  -  Xfj) 
where  .x ^  m  - 1)  .  .Y„  are  chosen  such  that 

nai(n)  =  b  and  therefore  =  -  2"'(A/n). 

Condition  2)  is  easy  to  show  because  we  can  always  add  any 
number  of  Gaussian  random  variables  to  the  set 
1  <  /  <  n }  with  all  the  random  variables  having  the  desired  first- 
and  second-order  moments.  By  the  definition  of  exchangeability 
this  new  augmented  set  has  members  that  are  still  exchangeable. 
For  1)  we  want  to  find  the  values  of  m  where 

Urn  n^a2(  n)  - /)■  =  0.  (A. 8) 

/!  -•  30 

Let  /p(.t,  v)  be  the  bivariate  Gaussian  density  function  with 
marginals  having  mean  0  and  variance  1,  and  correlation  p.  Also 
let  /^(Z)  be  the  distribution  function  of  Z  =  max{X,Y)  where 
X  and  Y  are  the  marginals  of  the  bivariate  Gaussian  distribution 
with  correlation  p.  Let  y  “l/(n  - 1).  Then  (A.8)  is  equivalent  to 

lim  n^F^(x„)  =b\  (A.9) 

n  « 

Note  that 

Fy(x„)-‘j  dyj  f^(x,y)dx.  (A.IO) 


/(-^J  =  (A. 14) 

From  (A.14)  as  n  -*  oo 

n'f;(x„)-h'-:^n‘-“  +  ^n-“logn  +  o(-).  (A.15) 
2ir  4w  \  n  j 

Equation  (A.9)  is  satisfied  when  a  >  1.  If  we  set  m  =  N/c  log  A', 
then  x„  =  -  y/c  log  n  .  If  c>  1  then  for  all  n>  N,  it  is  easily 
shown  that  x„  =  - y/ c„  log n  where  c„>l.  Therefore,  (A.15)  is 
satisfied  for  all  n  ^  Af  provided  that  m  <  Af/Iog  A^.  Recall  that  b 
and  therefore  N  are  independent  of  the  covergence  in  distribu¬ 
tion  of  to  a  Poisson  random  variable  with  parameter  b.  Using 
these  facts,  we  can  therefore  state  that  for  m  <  Af/log  N  the 
number  of  events  {C7jv(i,  J)  <  Xv  },  1  <  i  <  A^  occurring  converges 
in  distribution  (as  Af  -►  oo)  to  a  Poisson  distribution  with  parame¬ 
ter  NQ(-Xi^). 

To  use  Theorem  A2  to  evaluate 

Pr(min(  g(i,  y) ,  1  <  i  A/,  1  S  y  <  m)  <  -  N  +  1) 

=  Pr(min(G^,(i,  y),  1  <  i  <  N,  1  ^  j  <  Xv), 

we  again  need  two  conditions  analogous  to  those  just  obtained 
for  all  rf^N.  For  Gaussian  exchangeable  random  variables  the 
second  condition  is  trivial  to  show,  just  as  in  the  previous  case. 
For  the  first  condition  we  require  that 

lim  n^m^(a2{n)) b^.  (A. 16) 

^  00 

For  this  case  x„  —  —  Q~^(b/nm)  and  b  «•  NmQ{  -  x^ ).  We  again 
use  the  Hermite  polynomial  expansion  of  the  bivariate  density  to 
show  that 


1 


mn  - 1 


L 

1-0 


(«  -  l)y,'**  +  (m  -  l)y2*'  +  (nm  -  tj  -  m  -t-ljyj*' 

O  +  l)! 


[/(-x„)//,(-xJ]'  (A.17) 


We  first  look  at  the  inner  integrand.  4(x,  y)  can  be  expanded  as 
a  product  of  the  two  marginal  density  functions  (/(x)  and  f(y)), 
and  a  series  expansion  involving  Hermite  polynomials  H,{x)  (37) 

4(x,y)  -/(x)/(y)  E  ^HXx)HXy)  (A.ll) 

i-O  '■ 

where 


where  y, -l/(n-l),  yj-l/fm-l),  and  y,  =  l/(n  -  l)(m  - 1). 
From  (A.14)  and  (A.16)  as  n  -*  oo 


3  n' 


2ir  alogn 


alogn 

-I- - 

4w 


— - +  n-‘‘ 

a  logn 


+  0(n  "(alogn)^). 


(A.18) 


//,(x)-(-l)'e^’/^;^[e-^'/2] 

Equation  (A.IO)  can  then  be  written  as 

/v(x„)  =  /”  dyT  f{x)f{y)Y,  ^^,(x)  W,(y)  dx 

(A.12) 

-  I  7^[/(-xJ//,(-x„)]^  +  g(-x„)^ 

, -0  v'  +  t;. 

(A.13) 

Note  that  Q(-  x„)’~b/n  and.  by  setting  x,  —  -  logn ,  we 


Equation  (A.9)  is  satisfied  when  u  ^  2.  Using  the  same  argu¬ 
ments  as  in  the  first  case,  we  can  state  that  for  m  ^  A//21og  N  the 
number  of  events  {G,,(i,  j)<x^},  1  s  i  s  Af,  1  ^  y  ^  m  occur¬ 
ring  converges  in  distribution  (as  Af-»oo)  to  a  Poisson  distribu¬ 
tion  with  parameter  NmQ{-  Xjy).  We  have  thus  proved  the 
following  theorem  which  is  identical  to  the  theorem  in  Section 
III. 

Theorem  A3:  1)  If 

and  Y^  is  a  Poisson  random  variable  with  parameter  X,  - 
NQ(yjN/m ),  then  X^,  -*  Ty  in  distribution. 
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2)  If 

=  L  L  Ks(j’i^)  <-N) 

j-l  k-l 

and  is  a  Poisson  random  variable  with  parameter  X-,  = 
SmQ(yJ >!/ m  ),  then  A'v  in  distribution. 


.V 

m  < - 

21og/V 


Appendix  II 


Given  the  random  variables  {u(j,k),  j^N)  defined  in 
Section  III,  we  want  to  find  the  following  probability; 


I  ^ 

p  =  Pr  Pi  “( 2' ^  ^  I 

\y-i 


l<A;<m  (B.l) 


where  {u(j,k)}  are  binomial  random  variables  with  E{u(j,k)) 
=  0  and 


E(u{j,k)u(l,k)) 


(Af-l)(m-l),  j  =  l 
m  -  1 ,  j  *  (■ 


From  (3.2)  we  note  that  each  u(j,k)  is  the  sum  of  m-l 
independent  identically  distributed  (i.i.d.)  random  variables.  Let 
us  normalize  these  random  variables,  defining 


U{j) 


u{j,k) 

'^E{u{j,k)y 


1  <  j  <  N. 


For  large  m  we  know  that  U(j)  converges  to  a  standard  Gaus¬ 
sian  random  variable  (mean  0  and  variance  1)  by  the  central  limit 
theorem.  This  theorem  is  also  applicable  to  multivariate  distribu¬ 
tions;  the  ^/-variate  joint  distribution  of  (U{J),  Is  J  ^  N]  thus 
converges  to  the  multivariate  Gaussian  distribution  /V(0,  R)  where 


«(<■.» 


1, 

1 


A'-l  ’ 


as  m  grows  large;  see  [29,  38,  39], 

Under  the  assumption  that  the  {U(j)]  are  Gaussian  random 
variables,  the  results  of  Appendix  I  show  that 


(B.2) 

where  P  -  l)/(m  -1)  and  (x,,- •  x^).  Since  the 

Gaussian  assumption  is  only  true  asymptotically,  we  are  led  to 
the  question  of  whether  or  not  p  -•  Pc-  In  this  Appendix  we 
show  that  p  -*Pc  for  large  m  by  comparing  the  normal  approxi¬ 
mation  of  (1/(7),  \sjsN)  to  {t/(7),  IsjsN).  We  first 
state  some  necessary  results  from  normal  approximation  theory. 


A.  Error  Terms  in  the  Normal  Approximation  to  i.i.d 
Random  Vectors 


Let  us  first  look  at  the  problem  of  approximating  the  distribu¬ 
tion  of  sums  of  i.i.d.  random  variables  by  the  normal  distribution 
in  R'.  We  want  to  find  how  “close”  the  normal  distribution  is  to 
the  distribution  of  a  normalized  finite  sum  of  i.i.d.  random 
variables.  Let  {u,,  be  N  i.i.d.  random  variables  with 

Eu,  -  0,  Euf  —  1,  and  let 


Then  the  difference  between  f„,(x),  the  distribution  of  U„,.  and 
the  standard  normal  distnbution  <I>(x)  is  given  by 

00 

X  Q,(x)r>r‘'-  (B.3) 

1-1 

where  <j>(x)  is  the  probability  density  function  of  the  standard 
normal  distribution  and  Q,(x)  are  polynomials  derived  from  the 
standard  Hermite  polynomials  [37]: 

H,{x) 


(-1)'  *c, 

Q,(a;) - - - 

Letting  /„,(x)  be  the  probability  density  function  of  F,„(x).  we 
have 


c,  =  {-!)'  jH,{x)f„,{x)  dx. 


We  would  like  to  find  the  order  of  magnitude  of  error  terms  in 
{B.3).  In  particular,  we  are  interested  in  the  size  of  F„,(x)  -  <l>(x) 
-{«<>(•*)  Cl  (-^Vv^l- 

For  lattice  (i.e.,  discrete)  distributions  where  all  sample  values 
can  be  expressed  in  the  form  a  +  ih  where  a  and  h  are  constants 
and  i  is  an  integer,  the  following  result  holds  [40],  [41]: 


Ux)-«I.(x)»4>(x) 


Ci(x)  ^  hS^{x{m/h) 


/in 


+  o\ 


(B.4) 


where  5i(-)  is  a  correction  term  arising  from  the  discontinuities 
of  the  distribution  function  and  h  is  the  size  of  the  lattice.  The 
correction  term  is  the  periodic  function 

S,(x)  =  xmod(l) -0.5.  (B.5) 

For  distributions  in  R^  an  approximation  analogous  to  (B.4) 
is  slightly  more  complicated.  Now  let  { u, }  be  random  vectors 
with  £u,  —  0  and  Euju,  =  R.  Before  stating  a  theorem  from  [38] 
we  present  some  definitions.  Let  /„,(f)  be  the  characteristic 
function  of  the  distribution  F„(x).  The  characteristic  function 
can  be  expressed  in  the  following  way: 


/(f)  =exp 


(B.6) 


where  v  is  a  nonnegative  integer  vector,  v!=*n;1iv,!,  (if)''  = 
and  Ivl-Eiliv,.  The  coefficients  x,  are  called  the 
semi-invariants  of  f„.  For  a  Gaussian  random  vector  all  semi¬ 
invariants  for  |v|  >  2  are  0.  Also  define 


D’/(x) 


a/i'i 

- - - (x) 

ax,-' axj^  ■  ■  •  ax^r  ' 


(B.7) 


and 


OJ(x) 


The  following  theorem  from  [38]  can  now  be  stated. 
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Theorem  B1 :  Let  F„^(x)  be  defined  as  above.  Then 
sup 

x^R'' 

I  ^ 

=  nr‘  -  sup  L 

-  E  +o(m  (B.8) 

.1-3 

Equation  (B.8)  is  similar  to  (B.4)  in  that  the  first  term  on  the 
right  side  of  the  equation  is  an  error  term  due  to  the  lattice 
distribution,  and  the  second  term  describes  the  standard 
0(m” '  ’)  error  that  occurs  for  all  distributions. 

B.  Using  Normal  Approximation  Theory  to  Show  that  p  -*  p^ 

We  now  show  that  p-*Pa-  For  U(J)  we  have  h  =  hj  = 
1/viV  -  1 .  We  define  the  wth  moment  as 


Using  the  fact  that  the  random  variables  we  are  looking  at  are 
exchangeable  and  using  (B.10)-(B.13),  we  have  that 


\P-  Pg\^ 


2(m  —  l)v 


N(N-l)iN-2) 


il  +  {N-2)p)p-2lip 
(l  +  ((V-l)p)(l-p) 
(l  +  2p)p 


(l  +  (fV-l)p)(l  +  p) 


{N-2)<t>{p,l},p) 


+  o(m-'/2).  (B.15) 

Substituting  values  of  m  and  P  and  simplifying,  we  have 


^»  ”  //  ■  ■  ■  /^Vm(  dxj  •  •  •  dx V  (B.9)  - 

-  -  alogN 

where  .x’=n,'iix’'  and  f„(x)  is  the  Af-variate  joint  density  If*  Pcl^y  2^  ^ 

function  of  (U(j),  1<  j^N}.  For  the  case  where  x,  =  0  for 

|v|=l  it  is  easily  shown  that  p,  =x.  when  |v!  =  3.  The  semi-  wii/ji-^ 

invariants  we  need  to  use  in  {B.8)  are  taken  from  the  following  ^ 

third  moment  values: 

/  0,  i  =  j  =  k 

I  ^  ft  rl _ 1  y'N  ^  ^  »r 


(alogA/)^^^  /^"/^’■‘^“''^’SalogAf 

+  + --  .  (8.16)' 
2rr  ' 


EU{i)U(j)U(k) 


2(^-2) 

2 

/(N-iy(m-l) 


/  ”  jNtk 


i  *  j,  i  *  k,  j*  k. 


When  1/2  <  a  «*:  Af/log  N,  [p  -  pd  -►  0  as  N  -*oo.  Asymptoti¬ 
cally,  we  are  only  interested  in  the  area  around  a  =  2.  For  a  <2. 
Pc  -►  1  and  for  a  S;  2,  Pc  -♦  0.  Therefore,  we  can  state  that 


(B.17) 


(B.IO) 


as  N  grows  large. 


We  also  observe  that 


Z)^<I>(x)  -  /  '  •••  /  '  7  '  ■■/  *"'(>( “i>--->“y-i-^>.«;-.i.-  -.“v)<f“i  ■■■du  idu  I 

ao  •'-ao*'— ao  m 


where  <t>(x,^,-  •  -  is  the  joint  Gaussian  density  function  asso¬ 
ciated  with  random  variables  (U,  ,  l^j^k).  Then  if  we  let 
v,^=  {1, 1,1,0,-  •  -  ,0}  it  is  easily  shown  that 


Z)‘''<I>(x)  ^^(xi.xj.xj). 
Letting  {2,1,0,- ■ -,0}, 

.  M  (f +{^-2)p)jCi-2xjp 

1“  *('‘)'^“(iT(F.irp)(i;pi 


(B.12) 


(l  +  2p)p 


<».(x,,xj,x,)  (B.13) 


(l+(iV-l)p)(l-fp)^t'3"'  "  '  ' 

where  p  =  £(x,x2)  “l/(Af-l).  For  our  problem  Xj—fi  and 
m  —  N/a  log  N.  From  (B.8)  we  have  that 


+  ^-1/2  +o(m-‘/^).  (B.14) 

|.|-3  *- 


(B.ll) 
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Stochastic  Models  for  Interacting  Systems 


Anthony  Kuh 


ABSTRACT 


This  dissertation  is  concerned  with  the  behavior  of  various  systems  of 
interacting  components.  Interacting  systems  have  been  used  as  models  in  Econom¬ 
ics,  Biology,  Physics,  Computer  Science,  and  Engineering.  We  focus  on  discrete 
time  systems,  where  components  take  on  values  from  a  common  state  space  (usu¬ 
ally  a  binary  state  space).  Components,  or  cells,  are  updated  at  discrete  time 
instants,  with  the  updates  on  a  cell  depending  on  the  previous  value  of  some 
specified  set  of  cells.  The  components  of  these  interacting  systems  can  be  viewed  as 
simple  processing  units,  and  the  whole  system  can  be  viewed  as  a  parallel  comput¬ 
ing  machine.  One  of  the  main  goals  of  the  thesis  is  to  find  measures  for  the  com¬ 
putational  capabilities  of  different  types  of  interacting  systems. 

We  first  study  stochastic  modeb  for  local  spatially  interacting  systems  (LSIS). 
A  candidate  mathematical  model  for  local  interacting  systems  is  the  class  of  Mar¬ 
kov  Random  Fields  (MRF).  After  dbcussing  exbting  MRF  modeb  for  LSIS,  we 
introduce  a  discrete  time  synchronous  model  called  the  Completely  Causal  Markov 
Model  (CCMM).  Techniques  are  developed  to  analyze  the  behavior  of  the  CCMM 


and  assess  the  computational  capabilities  of  various  models. 

The  class  of  Hopfield  Associative  Memory  Networks  (AMN)  is  discussed.  Like 
LSIS,  AMN  consist  of  a  number  of  simple  interacting  components,  but  unlike  LSIS, 
AMN  components  usually  have  a  high  degree  of  interconnectivity.  AMN  models 
have  been  used  to  model  neural  networks.  They  have  powerful  computational 
capabilities,  and  they  have  been  used  to  solve  complex  combinatorial  optimization 
problems.  We  focus  on  assessing  the  computational  capabilities  of  AMN  models. 
The  asymptotic  information  storage  capacity  of  a  simple  AMN  model  is  derived, 
using  results  from  exchangeability  and  normal  approximation  theory.  Other 
models  for  AMN  are  also  developed  along  with  an  evaluation  of  tne  computational 
capabilities  of  these  networks. 

Finally,  we  present  an  application  for  some  models  of  LSIS,  dealing  with 
detecting  faults  that  occur  on  semiconductor  memory  chips.  LSIS  are  used  to 
model  Pattern  Sensitive  Faults  (PSF)  which  occur  when  a  read  or  write  operation 
is  faulty  for  some  particular  storage  location  when  the  memory  cells  exhibit  a  cer¬ 
tain  pattern. 
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Abstract 

This  dissertation  considers  some  aspects  of  the  behavior  of  several  neural 
networks.  We  examine  the  original  Hopfield  Associative  Memory  (HAM) 
and  derive  a  lower  bound  on  the  number  of  spurious  minima  when  the  stored 
memories  are  orthogonal.  Two  locally  interconnected  variations  of  the  basic 
HAM  network  are  proposed  in  which  the  maximum  distance  between  two 
neurons  that  can  be  connected  is  upper-bounded  by  B.  We  show  that  for 
such  locally  interconnected  networks  containing  N  neurons,  if  B/N  —*■  0 
as  N  —*  <x>  then  the  capacity  of  the  network  is  determined  by  B  and  is 
independent  of  N. 

A  macroscopic  analysis  technique  first  proposed  by  Amari  for  networks 
with  random,  nonsymmetric  connection  weights  is  modified  to  show  that 
HAMs  must  have  either  one  or  two  macroscopic  stable  states.  The  origi¬ 
nal  analysis  is  also  extended  to  networks  of  McCulloch-Pitts  neurons  with 
symmetric  connections.  The  analysis  and  simulations  show  that  the  macro¬ 
scopic  behavior  of  networks  with  symmetric  and  nonsymmetric  connections 
are  qualitatively  similar:  the  network  either  has  one  stable  equilibrium;  heis 
two  stable  equilibria;  or  oscillates  between  two  states. 

We  propose  a  new  class  of  neural  networks  which  are  derived  from  the  trel¬ 
lis  graph  representation  of  a  convolutional  code.  Such  a  trellis  network  can 
be  viewed  as  a  collection  of  winner-take-all  networks  that  are  interconnected 
to  reflect  the  structure  of  the  trellis  graph.  We  demonstrate  by  simulations 
that  a  trellis  network  with  suitably  defined  connection  weights  can  decode 
convolutional  coded  signals  with  added  errors  for  a  range  of  low  error  rates. 
We  show  that  each  of  the  subnetworks  is  stable  for  all  choices  of  parameters. 
For  a  restricted  set  of  parameters  and  a  monotonically  increasing  gain  with 


1 


sufficiently  large  derivative,  we  show  that  the  entire  network  is  equiasymp- 
totically  stable. 

We  propose  a  modified  form  of  the  trellis  network  which  can  tolerate 
errors  in  the  input  and  replace  failed  neurons.  Spare  neurons  are  added 
to  each  of  the  subnetworks  so  that  if  a  neuron  fails,  it  can  be  replaced  by 
any  spare  neuron  in  the  same  subnetwork.  Replacement  occurs  without  any 
supervision  through  modification  of  the  connections  between  the  subnetworks 
and  has  been  verified  by  simulations. 
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